ROUGE Metrics

Eval

ROUGE Metrics

まとめ
  • ROUGE is a family of metrics that measure n-gram or LCS overlap between generated and reference texts.
  • Using a summarization example, we calculate ROUGE-1/2/L and interpret their meaning.
  • We also discuss its correlation with human judgment and practical issues in long-text evaluation.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a family of metrics that measure the degree of overlap between a candidate summary and a reference summary.
ROUGE-1 and ROUGE-2 compute n-gram recall, while ROUGE-L is based on the Longest Common Subsequence (LCS) between the two texts.


Main Variants #

  • ROUGE-1 / ROUGE-2: Measure unigram and bigram recall — useful for extractive summarization and checking key-term coverage.
  • ROUGE-L: Based on the length of the longest common subsequence normalized by reference length — considers word order.
  • ROUGE-Lsum: Averages LCS scores at the sentence level — often used for long-document summarization.
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeLsum"], use_stemmer=True)
candidate = "Today is a very good weather day"
reference = "Today the weather is nice"
print(scorer.score(reference, candidate))

Advantages #

  • Lightweight to compute; multiple sub-metrics can be calculated at once.
  • Shows relatively strong correlation with human recall-based evaluation.
  • Standard metric for extractive summarization and information-content comparison.

Limitations #

  • Does not account for semantic similarity — paraphrases can unfairly reduce the score.
  • With only one reference summary, diversity in candidate outputs may be undervalued.
  • For Japanese or other morphologically rich languages, apply tokenization or subword splitting to stabilize scores.