A language model assigns probabilities to what comes next. Type a few words; pick a model order.
Trained on a tiny built-in corpus of cat/dog sentences — try words like the, cat,
dog, sat, ran, happy.
A language model (LM) is a system that assigns probabilities to sequences of words. Given some context, it answers one deceptively simple question: what is likely to come next? Everything from autocomplete to chatbots to translation is built on top of this single capability.
Formally, a language model estimates the probability of a sequence by the chain rule — the probability of each word given everything before it:
P(w₁, w₂, …, wₙ) = P(w₁) · P(w₂ | w₁) · P(w₃ | w₁, w₂) · … · P(wₙ | w₁ … wₙ₋₁)
The whole history of language modelling is really a history of how much context a model can use and how it represents that context — from ignoring word order entirely, to looking at the last few words, to learned vectors, to neural memory, to attention over the entire sequence.
This site walks through five milestones, each fixing a limitation of the one before. Open the Models page to step through each algorithm interactively.
Represent text as an unordered count vector over a vocabulary. Simple and fast, but throws away word order completely — "dog bites man" and "man bites dog" look identical.
Bring back a little order by conditioning on the previous n−1 words. Estimated by counting: P(next | context) = count(context, next) / count(context). Limited context and sparse for large n.
Learn a dense vector for every word so that words used in similar contexts land near each other. Captures meaning and analogies — but each word still has a single, context-free vector.
A recurrent network that reads one word at a time and carries a memory cell forward, using gates to decide what to keep, write and output. Handles variable-length context — but processes strictly in order, which is slow and forgetful over long ranges.
Drops recurrence entirely. Self-attention lets every token look directly at every other token in parallel, weighting what matters. This is the architecture behind modern large language models.
The standard yardstick is perplexity — roughly, how "surprised" the model is by real text. If a
model assigns probability p to each held-out word on average, perplexity is 1/p: lower
is better, and a perplexity of k means the model is as confused as if it were choosing uniformly among
k words at each step.
Each model on the next page lowers perplexity over the last by using context more effectively. Head to Models to build each one step by step, or Compare to see how n-gram orders stack up on the same input.