Score one output, every way

The same candidate and reference, judged by BLEU and ROUGE side by side.

Candidate
Reference

All seven metrics at a glance

What each metric scores, what it needs, and where it shines.

Metric Family Measures Needs a reference? Range Best for
PrecisionClassificationCorrectness of positive predictions No (needs labels)0 → 1Low-false-alarm tasks
RecallClassificationCoverage of real positives No (needs labels)0 → 1Low-miss tasks
F1ClassificationBalance of precision & recall No (needs labels)0 → 1Imbalanced classes
PerplexityProbabilisticModel surprise on held-out text No1 → ∞ (lower better)Comparing LMs
BLEUReference overlapN-gram precision + brevity Yes0 → 1Machine translation
ROUGEReference overlapN-gram / LCS recall Yes0 → 1Summarisation
LLM-as-judgeModel-gradedRubric quality via an LLM Optionalrubric-definedOpen-ended quality