まとめ

Summarise the main evaluation metrics for language models and when to use them.
Highlight automatic metrics for translation, summarisation, and generation tasks with code snippets.
Explain how to combine automated scores with human or LLM-based judgments while controlling bias.

Overview of language-model metrics #

Language tasks rarely have a single “gold” answer, so we rely on multiple metric families. We group them into n-gram based, embedding based, and LLM-as-a-judge approaches. Each family captures different aspects of quality—surface overlap, semantic similarity, and instruction compliance.

1. N-gram based metrics (surface overlap) #

Metric	What it measures	Characteristics
BLEU	n-gram precision with brevity penalty	Machine translation staple; struggles with short outputs and paraphrases
ROUGE-1/2/L	n-gram recall / longest common subsequence	Widely used for summarisation; benefits from multiple references
METEOR	Unigram matches plus synonym/inflection handling	Recall-oriented; strong English tooling via WordNet
chrF++	Character n-gram F-score	Robust for morphologically rich languages; extends chrF with word n-grams

Usage notes #

BLEU: compare against established benchmarks; apply smoothing for short references.
ROUGE: useful for extractive summaries; rouge-score package offers easy computation.
METEOR: captures semantic closeness but requires lexical resources.
chrF++: handy when tokenisation is hard (Japanese, Chinese, etc.).

import sacrebleu

references = [["今日は良い天気です。"]]
candidate = "今日はとてもいい天気だ。"

bleu_score = sacrebleu.corpus_bleu([candidate], references)
chrf_score = sacrebleu.corpus_chrf([candidate], references)

print(bleu_score.score, chrf_score.score)

2. Embedding-based metrics (semantic similarity) #

Metric	Model	Notes
BERTScore	BERT/RoBERTa token embeddings	Computes precision/recall/F1 in embedding space; multilingual support
MoverScore	Word Mover’s Distance variant	Gives more weight to rare words; higher computational cost
BLEURT	RoBERTa fine-tuned on human ratings	High correlation even with small datasets; requires downloading checkpoints
COMET	Multilingual Transformer with QA loss	Strong MT benchmark performance; CLI and API available
QAEval / ParaScore	Question generation + answer consistency	Evaluates semantic faithfulness via QA; setup overhead is higher

Usage notes #

bert-score library makes BERTScore easy to run, including Japanese models.
BLEURT/COMET need pre-trained checkpoints; offer best human correlation but cost more to run.
Choose BERTScore for a lightweight semantic check; upgrade to BLEURT/COMET when reliability is critical.

from bert_score import score

cands = ["今日はとてもいい天気だ。"]
refs = ["今日は良い天気です。"]

P, R, F1 = score(cands, refs, lang="ja", model_type="cl-tohoku/bert-base-japanese")
print(F1.mean().item())

3. LLM-as-a-judge #

Approach	Summary	Watch-outs
Direct grading by LLM	Provide reference + candidate to a model like GPT-4	Prompt design and bias control are crucial
Rubric-guided scoring	Supply detailed criteria (fluency, faithfulness, toxicity) per dimension	Higher cost; better analytic insight
LLM + QA consistency	Generate questions and check answers for alignment	Useful for long-form summarisation; quality depends on generated QA pairs

Guidelines #

Publish prompts and system messages to ensure transparency.
Fix temperature/randomness and average multiple runs for stability.
Periodically validate against human ratings to maintain trust.

Choosing metrics: a decision flow #

Clarify the task
Translation/summary → include n-gram metrics; open-ended generation → rely more on embeddings + LLM judges.
Check reference availability
Plenty of references → combine n-gram + embedding metrics; few references → lean on LLM or human evaluation.
Balance cost vs. fidelity
Need fast iteration → SacreBLEU or BERTScore; high-stakes evaluation → BLEURT/COMET or GPT-based judging.

Checklist #

Reference data quality/quantity assessed
Automated metrics validated against human or LLM judgments
Surface overlap and semantic alignment both considered
Prompts/configurations for LLM evaluators documented
Bias detection and mitigation plan in place

Language Model Evaluation

Language Model Evaluation

Overview of language-model metrics #

1. N-gram based metrics (surface overlap) #

Usage notes #

2. Embedding-based metrics (semantic similarity) #

Usage notes #

3. LLM-as-a-judge #

Guidelines #

Choosing metrics: a decision flow #

Checklist #

Further reading #