4.6.2
ROUGE Metrics
Summary
- ROUGE is a family of metrics that measure n-gram or LCS overlap between generated and reference texts.
- Using a summarization example, we calculate ROUGE-1/2/L and interpret their meaning.
- We also discuss its correlation with human judgment and practical issues in long-text evaluation.
- BLEU Score — understanding this concept first will make learning smoother
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a family of metrics that measure the degree of overlap between a candidate summary and a reference summary.
ROUGE-1 and ROUGE-2 compute n-gram recall, while ROUGE-L is based on the Longest Common Subsequence (LCS) between the two texts.
Main Variants #
- ROUGE-1 / ROUGE-2: Measure unigram and bigram recall — useful for extractive summarization and checking key-term coverage.
- ROUGE-L: Based on the length of the longest common subsequence normalized by reference length — considers word order.
- ROUGE-Lsum: Averages LCS scores at the sentence level — often used for long-document summarization.
| |
Advantages #
- Lightweight to compute; multiple sub-metrics can be calculated at once.
- Shows relatively strong correlation with human recall-based evaluation.
- Standard metric for extractive summarization and information-content comparison.
Limitations #
- Does not account for semantic similarity — paraphrases can unfairly reduce the score.
- With only one reference summary, diversity in candidate outputs may be undervalued.
- For Japanese or other morphologically rich languages, apply tokenization or subword splitting to stabilize scores.