4.6.2
ROUGE Metrics | Evaluating text summarization with n-gram overlap
- ROUGE is a family of metrics that measure n-gram or LCS overlap between generated and reference texts.
- Using a summarization example, we calculate ROUGE-1/2/L and interpret their meaning.
- We also discuss its correlation with human judgment and practical issues in long-text evaluation.
- BLEU Score — understanding this concept first will make learning smoother
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a family of metrics that measure the degree of overlap between a candidate summary and a reference summary.
ROUGE-1 and ROUGE-2 compute n-gram recall, while ROUGE-L is based on the Longest Common Subsequence (LCS) between the two texts.
Main Variants #
- ROUGE-1 / ROUGE-2: Measure unigram and bigram recall — useful for extractive summarization and checking key-term coverage.
- ROUGE-L: Based on the length of the longest common subsequence normalized by reference length — considers word order.
- ROUGE-Lsum: Averages LCS scores at the sentence level — often used for long-document summarization.
| |
Advantages #
- Lightweight to compute; multiple sub-metrics can be calculated at once.
- Shows relatively strong correlation with human recall-based evaluation.
- Standard metric for extractive summarization and information-content comparison.
Limitations #
- Does not account for semantic similarity — paraphrases can unfairly reduce the score.
- With only one reference summary, diversity in candidate outputs may be undervalued.
- For Japanese or other morphologically rich languages, apply tokenization or subword splitting to stabilize scores.
FAQ #
What is ROUGE-Lsum? #
ROUGE-Lsum (ROUGE Longest Common Subsequence for summarization) computes the LCS score sentence by sentence and then averages the results, rather than treating the entire text as one sequence (as ROUGE-L does). This makes it better suited for multi-sentence or long-document summaries.
Specifically, ROUGE-Lsum:
- Splits the candidate and reference into individual sentences.
- Computes the LCS between corresponding sentences.
- Averages the F1 scores across sentences.
It is the preferred variant for evaluating abstractive summarization models (e.g., BART, T5, GPT-based summarizers) on benchmarks like CNN/DailyMail and XSum.
What is the difference between ROUGE-L and ROUGE-Lsum? #
| ROUGE-L | ROUGE-Lsum | |
|---|---|---|
| Granularity | Full text as one sequence | Sentence-by-sentence |
| Sentence boundaries | Ignored | Respected |
| Better for | Short summaries | Long, multi-sentence summaries |
| Common benchmark use | General NLP evaluation | Abstractive summarization |
Use ROUGE-Lsum when evaluating paragraph-level or document-level summaries. Use ROUGE-L for sentence-level generation tasks.
What do ROUGE-1 and ROUGE-2 measure? #
- ROUGE-1: overlap of individual words (unigrams) between candidate and reference. Reflects vocabulary coverage.
- ROUGE-2: overlap of adjacent word pairs (bigrams). Rewards capturing phrases and local word order, making it more sensitive to fluency.
ROUGE-2 is often used as the primary metric in news summarization benchmarks because bigram overlap correlates better with human judgements than unigram overlap alone.
How is ROUGE scored: recall, precision, or F1? #
ROUGE can report all three, but recall is the traditional focus (the name stands for “Recall-Oriented Understudy”). Recall measures how much of the reference content appears in the candidate. In practice, the F1 score (harmonic mean of precision and recall) is now most commonly reported to balance coverage and conciseness.
The rouge_score library returns all three:
| |
When should I use BERTScore instead of ROUGE? #
ROUGE is purely lexical — it only counts exact word matches. It penalises valid paraphrases that use different vocabulary. BERTScore uses contextual embeddings to measure semantic similarity, rewarding paraphrases that preserve meaning even with different words.
Use ROUGE as a fast, reproducible baseline for extractive summarization or when comparing systems on standard benchmarks. Use BERTScore (or other embedding-based metrics) when semantic fidelity matters more than exact word overlap, such as for abstractive generation or multilingual evaluation.