Key idea: from the four confusion-matrix counts, precision measures how many of your
positive predictions were correct, and recall how many of the real positives you caught. F1 is their
harmonic mean.
From counts to scores
precision = TP / (TP + FP) # of what I flagged, how much was right
recall = TP / (TP + FN) # of what existed, how much I found
F1 = 2 · P · R / (P + R) # harmonic mean — punishes imbalance
accuracy = (TP + TN) / total # can mislead on skewed classes
Try it — edit the confusion matrix
Interactive · Precision / Recall / F1
Precision ↑Fewer false alarms
Recall ↑Fewer misses
Watch outAccuracy lies on imbalance
Perplexity probabilistic
Key idea: how surprised is the model by real text? Average the per-word
surprisal (negative log-probability) and exponentiate. Perplexity ≈ the effective number of choices
the model faces at each step.
Perplexity from per-word probabilities
H = −(1/N) · Σ log₂ P(wᵢ | context) # cross-entropy, in bits
PP = 2^H # lower is better# a perplexity of k ≈ choosing uniformly among k words each step
Try it — per-word probabilities → perplexity
Enter the probability the model gave each correct word. Taller bars = more surprise.
Interactive · Perplexity
Probabilities
Per-word surprisal (−log₂ p)
Range1 (perfect) → ∞
NeedsModel probabilities, no reference text
Used forComparing language models
BLEU reference overlap
Key idea: reward a candidate for sharing n-grams with a reference (clipped so it can't game
repetition), then penalise it for being too short. Precision-oriented — built for machine translation.
BLEU score
for n in 1..4:
pₙ = clipped_matches(cand, ref, n) / total_ngrams(cand, n)
BP = 1 if c > r else exp(1 − r/c) # brevity penalty (c,r = lengths)
BLEU = BP · exp( (1/4) · Σ log pₙ ) # geometric mean of precisions
Try it — candidate vs reference
Interactive · BLEU
Candidate
Reference
N-gram precisions & score
Leans onPrecision
Best forTranslation
Blind toMeaning, paraphrase
ROUGE reference overlap
Key idea: the recall-oriented mirror of BLEU — how much of the reference shows up in
the candidate. ROUGE-N counts n-grams; ROUGE-L uses the longest common subsequence. Built for summarisation.
Key idea: when no reference captures "good", ask a strong model to score the output against a
rubric. Flexible and correlates with humans — but inherits biases you must control for.
A scoring-rubric judge
prompt = rubric + criteria + the_output_to_grade
scores = LLM(prompt) # e.g. {relevance:4, accuracy:5, ...} on 1–5
overall = Σ wᵢ · scoreᵢ # weighted aggregate# pairwise variant: ask "is A or B better?" — average over both orderings# to cancel position bias.
Try it — aggregate a rubric
Drag the sliders as if you were the judge scoring one answer. Weights reflect a typical rubric.
Interactive · Rubric aggregation
Per-criterion (×weight)
StrengthOpen-ended quality, no reference
BiasesPosition, verbosity, self-preference
MitigateSwap orderings, clear rubric, calibrate
Note: this widget just aggregates a rubric you fill in — it doesn't call a model. It illustrates how a judge's
criterion scores combine into one number, and why the weights and rubric design matter so much.