Why Evaluate?

A model is only as trustworthy as the number you use to judge it. Evaluation turns "this output looks good" into a measurable score you can compare, track and optimise. But there is no single right metric — the question you're asking determines which one is meaningful.

Broadly, the metrics on this site fall into four families, each suited to a different kind of task:

Four Families of Metrics

Open the Metrics page to compute each one interactively.

Classification label vs label

When the output is a discrete label, score it against the truth with Precision, Recall and their harmonic mean F1 — all derived from the confusion matrix of true/false positives and negatives.

Probabilistic how surprised?

Perplexity measures how well a language model predicts held-out text — lower means less surprised. It's the standard intrinsic metric for language modelling.

Reference overlap text vs text

For generated text, compare against a human reference by counting shared n-grams. BLEU (precision-oriented, for translation) and ROUGE (recall-oriented, for summarisation) are the classics.

Model-graded judge by LLM

LLM-as-judge uses a strong model to score or rank outputs against a rubric — flexible and correlates with human judgement, but brings its own biases to watch for.

Choosing the Right One

Use precision/recall/F1 when outputs are labels and you care about the balance between false alarms and misses. Use perplexity to compare language models intrinsically. Use BLEU/ROUGE for generation when you have references — BLEU when precision matters (translation), ROUGE when coverage matters (summarisation). Reach for LLM-as-judge when quality is open-ended and no single reference captures it.

Head to Metrics to build each score step by step, or Compare to score one generated sentence across BLEU and ROUGE at once.