Evaluation Factory

Precision, Recall & F1 classification

Key idea: from the four confusion-matrix counts, precision measures how many of your positive predictions were correct, and recall how many of the real positives you caught. F1 is their harmonic mean.

From counts to scores

precision = TP / (TP + FP)        # of what I flagged, how much was right
recall    = TP / (TP + FN)        # of what existed, how much I found
F1        = 2 · P · R / (P + R)   # harmonic mean — punishes imbalance
accuracy  = (TP + TN) / total     # can mislead on skewed classes

Try it — edit the confusion matrix

Interactive · Precision / Recall / F1

True Pos

False Pos

False Neg

True Neg

Precision ↑Fewer false alarms

Recall ↑Fewer misses

Watch outAccuracy lies on imbalance

Perplexity probabilistic

Key idea: how surprised is the model by real text? Average the per-word surprisal (negative log-probability) and exponentiate. Perplexity ≈ the effective number of choices the model faces at each step.

Perplexity from per-word probabilities

H  = −(1/N) · Σ log₂ P(wᵢ | context)   # cross-entropy, in bits
PP = 2^H                               # lower is better

# a perplexity of k ≈ choosing uniformly among k words each step

Try it — per-word probabilities → perplexity

Enter the probability the model gave each correct word. Taller bars = more surprise.

Interactive · Perplexity

Probabilities

Per-word surprisal (−log₂ p)

Range1 (perfect) → ∞

NeedsModel probabilities, no reference text

Used forComparing language models

BLEU reference overlap

Key idea: reward a candidate for sharing n-grams with a reference (clipped so it can't game repetition), then penalise it for being too short. Precision-oriented — built for machine translation.

BLEU score

for n in 1..4:
    pₙ = clipped_matches(cand, ref, n) / total_ngrams(cand, n)

BP   = 1 if c > r else exp(1 − r/c)   # brevity penalty (c,r = lengths)
BLEU = BP · exp( (1/4) · Σ log pₙ )      # geometric mean of precisions

Try it — candidate vs reference

Interactive · BLEU

Candidate

Reference

N-gram precisions & score

Leans onPrecision

Best forTranslation

Blind toMeaning, paraphrase

ROUGE reference overlap

Key idea: the recall-oriented mirror of BLEU — how much of the reference shows up in the candidate. ROUGE-N counts n-grams; ROUGE-L uses the longest common subsequence. Built for summarisation.

ROUGE-N and ROUGE-L

ROUGE-N recall = overlap_ngrams(cand, ref, n) / total_ngrams(ref, n)
ROUGE-N prec.  = overlap_ngrams(cand, ref, n) / total_ngrams(cand, n)

LCS = longest_common_subsequence(cand, ref)   # in order, not contiguous
ROUGE-L recall = LCS / len(ref),  prec = LCS / len(cand)

Try it — candidate vs reference

Interactive · ROUGE

Candidate

Reference

Scores

Leans onRecall

Best forSummarisation

ROUGE-LRewards in-order overlap

LLM-as-judge model-graded

Key idea: when no reference captures "good", ask a strong model to score the output against a rubric. Flexible and correlates with humans — but inherits biases you must control for.

A scoring-rubric judge

prompt = rubric + criteria + the_output_to_grade
scores = LLM(prompt)            # e.g. {relevance:4, accuracy:5, ...} on 1–5
overall = Σ wᵢ · scoreᵢ          # weighted aggregate

# pairwise variant: ask "is A or B better?" — average over both orderings
# to cancel position bias.

Try it — aggregate a rubric

Drag the sliders as if you were the judge scoring one answer. Weights reflect a typical rubric.

Interactive · Rubric aggregation

Relevance ×0.30

Accuracy ×0.35

Completeness ×0.20

Fluency ×0.15

Per-criterion (×weight)

StrengthOpen-ended quality, no reference

BiasesPosition, verbosity, self-preference

MitigateSwap orderings, clear rubric, calibrate

Note: this widget just aggregates a rubric you fill in — it doesn't call a model. It illustrates how a judge's criterion scores combine into one number, and why the weights and rubric design matter so much.

Metrics, Step by Step

Jump to

Precision, Recall & F1 classification

Perplexity probabilistic

BLEU reference overlap

ROUGE reference overlap

LLM-as-judge model-graded