Score one output, every way

The same candidate and reference, judged by BLEU and ROUGE side by side.

Candidate

Reference

All seven metrics at a glance

What each metric scores, what it needs, and where it shines.

Metric	Family	Measures	Needs a reference?	Range	Best for
Precision	Classification	Correctness of positive predictions	No (needs labels)	0 → 1	Low-false-alarm tasks
Recall	Classification	Coverage of real positives	No (needs labels)	0 → 1	Low-miss tasks
F1	Classification	Balance of precision & recall	No (needs labels)	0 → 1	Imbalanced classes
Perplexity	Probabilistic	Model surprise on held-out text	No	1 → ∞ (lower better)	Comparing LMs
BLEU	Reference overlap	N-gram precision + brevity	Yes	0 → 1	Machine translation
ROUGE	Reference overlap	N-gram / LCS recall	Yes	0 → 1	Summarisation
LLM-as-judge	Model-graded	Rubric quality via an LLM	Optional	rubric-defined	Open-ended quality