まとめ
- ROUGE is a family of metrics that measure n-gram or LCS overlap between generated and reference texts.
- Using a summarization example, we calculate ROUGE-1/2/L and interpret their meaning.
- We also discuss its correlation with human judgment and practical issues in long-text evaluation.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a family of metrics that measure the degree of overlap between a candidate summary and a reference summary.
ROUGE-1 and ROUGE-2 compute n-gram recall, while ROUGE-L is based on the Longest Common Subsequence (LCS) between the two texts.
Main Variants #
- ROUGE-1 / ROUGE-2: Measure unigram and bigram recall — useful for extractive summarization and checking key-term coverage.
- ROUGE-L: Based on the length of the longest common subsequence normalized by reference length — considers word order.
- ROUGE-Lsum: Averages LCS scores at the sentence level — often used for long-document summarization.
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeLsum"], use_stemmer=True)
candidate = "Today is a very good weather day"
reference = "Today the weather is nice"
print(scorer.score(reference, candidate))
Advantages #
- Lightweight to compute; multiple sub-metrics can be calculated at once.
- Shows relatively strong correlation with human recall-based evaluation.
- Standard metric for extractive summarization and information-content comparison.
Limitations #
- Does not account for semantic similarity — paraphrases can unfairly reduce the score.
- With only one reference summary, diversity in candidate outputs may be undervalued.
- For Japanese or other morphologically rich languages, apply tokenization or subword splitting to stabilize scores.