Every classification metric starts from four counts. Edit them and watch the scores move.
A model is only as trustworthy as the number you use to judge it. Evaluation turns "this output looks good" into a measurable score you can compare, track and optimise. But there is no single right metric — the question you're asking determines which one is meaningful.
Broadly, the metrics on this site fall into four families, each suited to a different kind of task:
Open the Metrics page to compute each one interactively.
When the output is a discrete label, score it against the truth with Precision, Recall and their harmonic mean F1 — all derived from the confusion matrix of true/false positives and negatives.
Perplexity measures how well a language model predicts held-out text — lower means less surprised. It's the standard intrinsic metric for language modelling.
For generated text, compare against a human reference by counting shared n-grams. BLEU (precision-oriented, for translation) and ROUGE (recall-oriented, for summarisation) are the classics.
LLM-as-judge uses a strong model to score or rank outputs against a rubric — flexible and correlates with human judgement, but brings its own biases to watch for.
Use precision/recall/F1 when outputs are labels and you care about the balance between false alarms and misses. Use perplexity to compare language models intrinsically. Use BLEU/ROUGE for generation when you have references — BLEU when precision matters (translation), ROUGE when coverage matters (summarisation). Reach for LLM-as-judge when quality is open-ended and no single reference captures it.
Head to Metrics to build each score step by step, or Compare to score one generated sentence across BLEU and ROUGE at once.