ROUGE Metrics

Created: 2019-02-19 Last updated: 2020-02-12 Read time: 2 min

まとめ

ROUGE is a family of metrics that measure n-gram or LCS overlap between generated and reference texts.
Using a summarization example, we calculate ROUGE-1/2/L and interpret their meaning.
We also discuss its correlation with human judgment and practical issues in long-text evaluation.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a family of metrics that measure the degree of overlap between a candidate summary and a reference summary.
ROUGE-1 and ROUGE-2 compute n-gram recall, while ROUGE-L is based on the Longest Common Subsequence (LCS) between the two texts.

Main Variants #

ROUGE-1 / ROUGE-2: Measure unigram and bigram recall — useful for extractive summarization and checking key-term coverage.
ROUGE-L: Based on the length of the longest common subsequence normalized by reference length — considers word order.
ROUGE-Lsum: Averages LCS scores at the sentence level — often used for long-document summarization.

from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeLsum"], use_stemmer=True)
candidate = "Today is a very good weather day"
reference = "Today the weather is nice"
print(scorer.score(reference, candidate))

Advantages #

Lightweight to compute; multiple sub-metrics can be calculated at once.
Shows relatively strong correlation with human recall-based evaluation.
Standard metric for extractive summarization and information-content comparison.

Limitations #

Does not account for semantic similarity — paraphrases can unfairly reduce the score.
With only one reference summary, diversity in candidate outputs may be undervalued.
For Japanese or other morphologically rich languages, apply tokenization or subword splitting to stabilize scores.