BLEU Metric

Created: 2019-02-02 Last updated: 2020-01-29 Read time: 3 min

まとめ

BLEU measures translation quality using n-gram overlap between candidate and reference sentences.
We implement BLEU with n-gram precision and brevity penalty, and check how the score behaves.
Discusses its sensitivity to word order and synonyms, and how multiple references can mitigate this.

1. Concept of BLEU #

Compute modified precision for 1-gram to n-gram (typically up to 4-gram) between candidate and reference sentences.
Take the logarithmic average of these precisions and convert it to a geometric mean.
Apply a brevity penalty if the candidate is shorter than the reference to penalize excessive brevity.

BLEU ranges from 0 to 1 — higher values indicate translations closer to the reference.

2. Implementation Example in Python 3.13 #

A pure standard-library implementation of BLEU, applied to a sample candidate and references.

from __future__ import annotations

import math
from collections import Counter
from collections.abc import Iterable, Sequence


def ngram_counts(tokens: Sequence[str], n: int) -> Counter[tuple[str, ...]]:
    """Count n-gram frequencies from a sequence of tokens."""
    return Counter(tuple(tokens[i : i + n]) for i in range(len(tokens) - n + 1))


def modified_precision(
    candidate: Sequence[str],
    references: Iterable[Sequence[str]],
    n: int,
) -> tuple[int, int]:
    """Return (matching n-gram count, total n-gram count) for candidate vs references."""
    cand_counts = ngram_counts(candidate, n)
    max_ref: Counter[tuple[str, ...]] = Counter()
    for ref in references:
        max_ref |= ngram_counts(ref, n)
    overlap = {ng: min(count, max_ref[ng]) for ng, count in cand_counts.items()}
    return sum(overlap.values()), max(1, sum(cand_counts.values()))


def brevity_penalty(candidate_len: int, reference_lens: Iterable[int]) -> float:
    """Compute brevity penalty when candidate is too short."""
    if candidate_len == 0:
        return 0.0
    closest_ref_len = min(reference_lens, key=lambda r: (abs(r - candidate_len), r))
    ratio = candidate_len / closest_ref_len
    if ratio > 1:
        return 1.0
    return math.exp(1 - 1 / ratio)


def bleu(candidate: str, references: Sequence[str], max_n: int = 4) -> float:
    """Compute BLEU score from candidate and reference sentences."""
    candidate_tokens = candidate.split()
    reference_tokens = [ref.split() for ref in references]
    precisions: list[float] = []
    for n_value in range(1, max_n + 1):
        overlap, total = modified_precision(candidate_tokens, reference_tokens, n_value)
        precisions.append(overlap / total)
    if min(precisions) == 0:
        return 0.0
    geometric_mean = math.exp(sum(math.log(p) for p in precisions) / max_n)
    penalty = brevity_penalty(len(candidate_tokens), (len(ref) for ref in reference_tokens))
    return penalty * geometric_mean


if __name__ == "__main__":
    candidate_sentence = "the cat is on the mat"
    reference_sentences = [
        "there is a cat on the mat",
        "the cat sits on the mat",
    ]
    score = bleu(candidate_sentence, reference_sentences)
    print(f"BLEU = {score:.3f}")

Output example:

BLEU = 0.638

3. Advantages #

Easy and fast to implement; widely used as a benchmark in machine translation.
Multiple reference sentences improve robustness to paraphrasing.

4. Limitations #

Sensitive to word order and synonyms — semantically correct translations may score low.
Correlation with human judgment drops for long summaries.
In languages like Japanese, perform tokenization or word segmentation before computation.

Summary #

BLEU estimates translation quality based on n-gram overlap and brevity penalty.
Can be implemented in pure Python 3.13 with type hints for reusability.
Combine with metrics like ROUGE or METEOR to capture lexical diversity and semantic similarity.