まとめ
- Summarise the main evaluation metrics for language models and when to use them.
- Highlight automatic metrics for translation, summarisation, and generation tasks with code snippets.
- Explain how to combine automated scores with human or LLM-based judgments while controlling bias.
Overview of language-model metrics #
Language tasks rarely have a single “gold” answer, so we rely on multiple metric families. We group them into n-gram based, embedding based, and LLM-as-a-judge approaches. Each family captures different aspects of quality—surface overlap, semantic similarity, and instruction compliance.
1. N-gram based metrics (surface overlap) #
| Metric | What it measures | Characteristics |
|---|---|---|
| BLEU | n-gram precision with brevity penalty | Machine translation staple; struggles with short outputs and paraphrases |
| ROUGE-1/2/L | n-gram recall / longest common subsequence | Widely used for summarisation; benefits from multiple references |
| METEOR | Unigram matches plus synonym/inflection handling | Recall-oriented; strong English tooling via WordNet |
| chrF++ | Character n-gram F-score | Robust for morphologically rich languages; extends chrF with word n-grams |
Usage notes #
- BLEU: compare against established benchmarks; apply smoothing for short references.
- ROUGE: useful for extractive summaries;
rouge-scorepackage offers easy computation. - METEOR: captures semantic closeness but requires lexical resources.
- chrF++: handy when tokenisation is hard (Japanese, Chinese, etc.).
import sacrebleu
references = [["今日は良い天気です。"]]
candidate = "今日はとてもいい天気だ。"
bleu_score = sacrebleu.corpus_bleu([candidate], references)
chrf_score = sacrebleu.corpus_chrf([candidate], references)
print(bleu_score.score, chrf_score.score)
2. Embedding-based metrics (semantic similarity) #
| Metric | Model | Notes |
|---|---|---|
| BERTScore | BERT/RoBERTa token embeddings | Computes precision/recall/F1 in embedding space; multilingual support |
| MoverScore | Word Mover’s Distance variant | Gives more weight to rare words; higher computational cost |
| BLEURT | RoBERTa fine-tuned on human ratings | High correlation even with small datasets; requires downloading checkpoints |
| COMET | Multilingual Transformer with QA loss | Strong MT benchmark performance; CLI and API available |
| QAEval / ParaScore | Question generation + answer consistency | Evaluates semantic faithfulness via QA; setup overhead is higher |
Usage notes #
bert-scorelibrary makes BERTScore easy to run, including Japanese models.- BLEURT/COMET need pre-trained checkpoints; offer best human correlation but cost more to run.
- Choose BERTScore for a lightweight semantic check; upgrade to BLEURT/COMET when reliability is critical.
from bert_score import score
cands = ["今日はとてもいい天気だ。"]
refs = ["今日は良い天気です。"]
P, R, F1 = score(cands, refs, lang="ja", model_type="cl-tohoku/bert-base-japanese")
print(F1.mean().item())
3. LLM-as-a-judge #
| Approach | Summary | Watch-outs |
|---|---|---|
| Direct grading by LLM | Provide reference + candidate to a model like GPT-4 | Prompt design and bias control are crucial |
| Rubric-guided scoring | Supply detailed criteria (fluency, faithfulness, toxicity) per dimension | Higher cost; better analytic insight |
| LLM + QA consistency | Generate questions and check answers for alignment | Useful for long-form summarisation; quality depends on generated QA pairs |
Guidelines #
- Publish prompts and system messages to ensure transparency.
- Fix temperature/randomness and average multiple runs for stability.
- Periodically validate against human ratings to maintain trust.
Choosing metrics: a decision flow #
- Clarify the task
Translation/summary → include n-gram metrics; open-ended generation → rely more on embeddings + LLM judges. - Check reference availability
Plenty of references → combine n-gram + embedding metrics; few references → lean on LLM or human evaluation. - Balance cost vs. fidelity
Need fast iteration → SacreBLEU or BERTScore; high-stakes evaluation → BLEURT/COMET or GPT-based judging.
Checklist #
- Reference data quality/quantity assessed
- Automated metrics validated against human or LLM judgments
- Surface overlap and semantic alignment both considered
- Prompts/configurations for LLM evaluators documented
- Bias detection and mitigation plan in place
Further reading #
- BLEU / chrF++: SacreBLEU
- ROUGE:
rouge-score - BERTScore: bert-score GitHub
- BLEURT: Google Research BLEURT
- COMET: Unbabel COMET
- LLM-as-a-Judge: OpenAI evals
Combine multiple metrics, check correlation with human ratings, and iterate on the evaluation pipeline as your model or task evolves.