Language Models, Step by Step

Five models, each with the idea, the pseudocode, and an interactive widget you can drive yourself. They build on each other — start at the top and watch what each one adds.

Jump to

Bag-of-Words counts

Key idea: represent a document as a vector of word counts over a fixed vocabulary. Order is discarded — only "which words, how often" survives.
Build a bag-of-words vector
function bow_vectorise(docs):
    vocab = sorted(unique_words(docs))   # the columns
    vectors = []
    for doc in docs:
        v = [0] * len(vocab)
        for word in tokenise(doc):
            v[ vocab.index(word) ] += 1     # count occurrences
        vectors.append(v)
    return vocab, vectors

# similarity between two docs = cosine of their count vectors
sim(a, b) = dot(a, b) / ( ||a|| * ||b|| )
Try it — vectorise & compare two documents

Edit either document. The shared vocabulary forms the columns; each doc becomes a row of counts. Cosine similarity measures overlap — notice it ignores order entirely.

Interactive · Bag-of-Words
Doc A
Doc B
Count vectors
CapturesWord presence & frequency
IgnoresOrder, grammar, meaning
Used forClassification, search, TF-IDF

N-grams counts + order

Key idea: approximate "the next word depends on everything before it" with "the next word depends only on the previous n−1 words." Estimate the probabilities by counting.
Train an n-gram model by counting
function train_ngram(corpus, n):
    counts = {}                          # (context) -> {next_word: count}
    for sentence in corpus:
        toks = tokenise(sentence)
        for i in 0 .. len(toks) - n:
            context = toks[i : i+n-1]
            next    = toks[i+n-1]
            counts[context][next] += 1
    return counts

# prediction (with simple back-off if context unseen)
P(next | context) = counts[context][next] / sum(counts[context])
Try it — slide the window & count

Pick an order, then Step to slide the n-gram window across the sentence, accumulating counts. Those counts are the model.

Interactive · N-gram counter
Sentence
Sentence & sliding window
Counts so far (top 10)
ContextPrevious n−1 words
WeaknessSparse & huge for large n
TricksSmoothing, back-off

Word2Vec embeddings

Key idea: "a word is known by the company it keeps." Train each word's vector to predict the words around it, and words used in similar contexts end up with similar vectors.
Skip-gram training data + objective
# 1. Slide a window; emit (center, context) pairs
for center_i in sentence:
    for j in [center_i - W .. center_i + W], j != center_i:
        emit( center=word[center_i], context=word[j] )

# 2. Learn vectors so the center predicts its context words
maximise  P(context | center) = softmax( v_context · v_center )
# gradient ascent nudges co-occurring words' vectors together
Try it — generate skip-gram pairs

Step through each center word; the window emits a training pair with every neighbour.

Interactive · Skip-gram pairs
Sentence
Sentence (center + context)
(center → context) pairs
…and explore the learned space

Once trained, vectors cluster by meaning. Click a word to see its nearest neighbours by cosine similarity (these vectors are illustrative, pre-placed in 2D).

Interactive · Embedding map
2D embedding space
Nearest neighbours (cosine)
OutputOne dense vector per word
CapturesSimilarity, analogies
LimitOne vector regardless of context

LSTM sequence memory

Key idea: read the sequence one step at a time, carrying a memory cell. Three gates decide what to forget, what to write, and what to output — letting information survive across many steps.
LSTM cell update (per time step)
# x = current input, h = previous hidden, c = previous cell
f = sigmoid(W_f·x + U_f·h + b_f)     # forget gate  (0..1)
i = sigmoid(W_i·x + U_i·h + b_i)     # input gate   (0..1)
o = sigmoid(W_o·x + U_o·h + b_o)     # output gate  (0..1)
g = tanh   (W_g·x + U_g·h + b_g)     # candidate memory

c = f * c  +  i * g                  # keep some old, add some new
h = o * tanh(c)                      # exposed output
Try it — watch the gates & memory

Feed a sequence of numbers (toy 1-unit LSTM, fixed weights). Step to see each gate open/close and the memory cell c and hidden state h evolve.

Interactive · LSTM gates
Sequence
Input sequence
Gate activations (this step)
State
ContextWhole sequence, via memory
StrengthVariable length, long-ish range
LimitSequential — hard to parallelise

Transformer attention

Key idea: let every token look at every other token directly and in parallel. Self-attention computes, for each token, a weighted blend of all tokens — the weights say "how much should I attend to you?"
Scaled dot-product self-attention
# each token gets Query, Key, Value vectors (learned projections)
Q = X · W_q,   K = X · W_k,   V = X · W_v

scores   = (Q · Kᵀ) / sqrt(d)        # how well each query matches each key
weights  = softmax(scores, axis = keys)   # each row sums to 1
output   = weights · V               # blend of values

# done for all tokens at once → fully parallel, unlike an RNN
Try it — the attention matrix

Step reveals one query row at a time: its scaled dot-products softmaxed into attention weights over every key. Brighter cells = more attention.

Interactive · Self-attention
Tokens
Attention weights (rows = query, cols = key, %)
Current query's distribution
ContextEvery token, all at once
StrengthParallel, long-range
PowersGPT, BERT, T5, Llama …

Note: this demo uses fixed, deterministic embeddings with Q=K=V (no trained projections), so it shows the mechanism of attention, not learned behaviour. Real models learn W_q, W_k, W_v and stack many attention heads and layers.