まとめ
- BLEU measures translation quality using n-gram overlap between candidate and reference sentences.
- We implement BLEU with n-gram precision and brevity penalty, and check how the score behaves.
- Discusses its sensitivity to word order and synonyms, and how multiple references can mitigate this.
1. Concept of BLEU #
- Compute modified precision for 1-gram to n-gram (typically up to 4-gram) between candidate and reference sentences.
- Take the logarithmic average of these precisions and convert it to a geometric mean.
- Apply a brevity penalty if the candidate is shorter than the reference to penalize excessive brevity.
BLEU ranges from 0 to 1 — higher values indicate translations closer to the reference.
2. Implementation Example in Python 3.13 #
A pure standard-library implementation of BLEU, applied to a sample candidate and references.
from __future__ import annotations
import math
from collections import Counter
from collections.abc import Iterable, Sequence
def ngram_counts(tokens: Sequence[str], n: int) -> Counter[tuple[str, ...]]:
"""Count n-gram frequencies from a sequence of tokens."""
return Counter(tuple(tokens[i : i + n]) for i in range(len(tokens) - n + 1))
def modified_precision(
candidate: Sequence[str],
references: Iterable[Sequence[str]],
n: int,
) -> tuple[int, int]:
"""Return (matching n-gram count, total n-gram count) for candidate vs references."""
cand_counts = ngram_counts(candidate, n)
max_ref: Counter[tuple[str, ...]] = Counter()
for ref in references:
max_ref |= ngram_counts(ref, n)
overlap = {ng: min(count, max_ref[ng]) for ng, count in cand_counts.items()}
return sum(overlap.values()), max(1, sum(cand_counts.values()))
def brevity_penalty(candidate_len: int, reference_lens: Iterable[int]) -> float:
"""Compute brevity penalty when candidate is too short."""
if candidate_len == 0:
return 0.0
closest_ref_len = min(reference_lens, key=lambda r: (abs(r - candidate_len), r))
ratio = candidate_len / closest_ref_len
if ratio > 1:
return 1.0
return math.exp(1 - 1 / ratio)
def bleu(candidate: str, references: Sequence[str], max_n: int = 4) -> float:
"""Compute BLEU score from candidate and reference sentences."""
candidate_tokens = candidate.split()
reference_tokens = [ref.split() for ref in references]
precisions: list[float] = []
for n_value in range(1, max_n + 1):
overlap, total = modified_precision(candidate_tokens, reference_tokens, n_value)
precisions.append(overlap / total)
if min(precisions) == 0:
return 0.0
geometric_mean = math.exp(sum(math.log(p) for p in precisions) / max_n)
penalty = brevity_penalty(len(candidate_tokens), (len(ref) for ref in reference_tokens))
return penalty * geometric_mean
if __name__ == "__main__":
candidate_sentence = "the cat is on the mat"
reference_sentences = [
"there is a cat on the mat",
"the cat sits on the mat",
]
score = bleu(candidate_sentence, reference_sentences)
print(f"BLEU = {score:.3f}")
Output example:
BLEU = 0.638
3. Advantages #
- Easy and fast to implement; widely used as a benchmark in machine translation.
- Multiple reference sentences improve robustness to paraphrasing.
4. Limitations #
- Sensitive to word order and synonyms — semantically correct translations may score low.
- Correlation with human judgment drops for long summaries.
- In languages like Japanese, perform tokenization or word segmentation before computation.
Summary #
- BLEU estimates translation quality based on n-gram overlap and brevity penalty.
- Can be implemented in pure Python 3.13 with type hints for reusability.
- Combine with metrics like ROUGE or METEOR to capture lexical diversity and semantic similarity.