Language Model Evaluation

Eval

Language Model Evaluation

まとめ
  • Summarise the main evaluation metrics for language models and when to use them.
  • Highlight automatic metrics for translation, summarisation, and generation tasks with code snippets.
  • Explain how to combine automated scores with human or LLM-based judgments while controlling bias.

Overview of language-model metrics #

Language tasks rarely have a single “gold” answer, so we rely on multiple metric families. We group them into n-gram based, embedding based, and LLM-as-a-judge approaches. Each family captures different aspects of quality—surface overlap, semantic similarity, and instruction compliance.


1. N-gram based metrics (surface overlap) #

MetricWhat it measuresCharacteristics
BLEUn-gram precision with brevity penaltyMachine translation staple; struggles with short outputs and paraphrases
ROUGE-1/2/Ln-gram recall / longest common subsequenceWidely used for summarisation; benefits from multiple references
METEORUnigram matches plus synonym/inflection handlingRecall-oriented; strong English tooling via WordNet
chrF++Character n-gram F-scoreRobust for morphologically rich languages; extends chrF with word n-grams

Usage notes #

  • BLEU: compare against established benchmarks; apply smoothing for short references.
  • ROUGE: useful for extractive summaries; rouge-score package offers easy computation.
  • METEOR: captures semantic closeness but requires lexical resources.
  • chrF++: handy when tokenisation is hard (Japanese, Chinese, etc.).
import sacrebleu

references = [["今日は良い天気です。"]]
candidate = "今日はとてもいい天気だ。"

bleu_score = sacrebleu.corpus_bleu([candidate], references)
chrf_score = sacrebleu.corpus_chrf([candidate], references)

print(bleu_score.score, chrf_score.score)

2. Embedding-based metrics (semantic similarity) #

MetricModelNotes
BERTScoreBERT/RoBERTa token embeddingsComputes precision/recall/F1 in embedding space; multilingual support
MoverScoreWord Mover’s Distance variantGives more weight to rare words; higher computational cost
BLEURTRoBERTa fine-tuned on human ratingsHigh correlation even with small datasets; requires downloading checkpoints
COMETMultilingual Transformer with QA lossStrong MT benchmark performance; CLI and API available
QAEval / ParaScoreQuestion generation + answer consistencyEvaluates semantic faithfulness via QA; setup overhead is higher

Usage notes #

  • bert-score library makes BERTScore easy to run, including Japanese models.
  • BLEURT/COMET need pre-trained checkpoints; offer best human correlation but cost more to run.
  • Choose BERTScore for a lightweight semantic check; upgrade to BLEURT/COMET when reliability is critical.
from bert_score import score

cands = ["今日はとてもいい天気だ。"]
refs = ["今日は良い天気です。"]

P, R, F1 = score(cands, refs, lang="ja", model_type="cl-tohoku/bert-base-japanese")
print(F1.mean().item())

3. LLM-as-a-judge #

ApproachSummaryWatch-outs
Direct grading by LLMProvide reference + candidate to a model like GPT-4Prompt design and bias control are crucial
Rubric-guided scoringSupply detailed criteria (fluency, faithfulness, toxicity) per dimensionHigher cost; better analytic insight
LLM + QA consistencyGenerate questions and check answers for alignmentUseful for long-form summarisation; quality depends on generated QA pairs

Guidelines #

  • Publish prompts and system messages to ensure transparency.
  • Fix temperature/randomness and average multiple runs for stability.
  • Periodically validate against human ratings to maintain trust.

Choosing metrics: a decision flow #

  1. Clarify the task
    Translation/summary → include n-gram metrics; open-ended generation → rely more on embeddings + LLM judges.
  2. Check reference availability
    Plenty of references → combine n-gram + embedding metrics; few references → lean on LLM or human evaluation.
  3. Balance cost vs. fidelity
    Need fast iteration → SacreBLEU or BERTScore; high-stakes evaluation → BLEURT/COMET or GPT-based judging.

Checklist #

  • Reference data quality/quantity assessed
  • Automated metrics validated against human or LLM judgments
  • Surface overlap and semantic alignment both considered
  • Prompts/configurations for LLM evaluators documented
  • Bias detection and mitigation plan in place

Further reading #

Combine multiple metrics, check correlation with human ratings, and iterate on the evaluation pipeline as your model or task evolves.