The same candidate and reference, judged by BLEU and ROUGE side by side.
What each metric scores, what it needs, and where it shines.
| Metric | Family | Measures | Needs a reference? | Range | Best for |
|---|---|---|---|---|---|
| Precision | Classification | Correctness of positive predictions | No (needs labels) | 0 → 1 | Low-false-alarm tasks |
| Recall | Classification | Coverage of real positives | No (needs labels) | 0 → 1 | Low-miss tasks |
| F1 | Classification | Balance of precision & recall | No (needs labels) | 0 → 1 | Imbalanced classes |
| Perplexity | Probabilistic | Model surprise on held-out text | No | 1 → ∞ (lower better) | Comparing LMs |
| BLEU | Reference overlap | N-gram precision + brevity | Yes | 0 → 1 | Machine translation |
| ROUGE | Reference overlap | N-gram / LCS recall | Yes | 0 → 1 | Summarisation |
| LLM-as-judge | Model-graded | Rubric quality via an LLM | Optional | rubric-defined | Open-ended quality |