LMs Factory | Models

Bag-of-Words counts

Key idea: represent a document as a vector of word counts over a fixed vocabulary. Order is discarded — only "which words, how often" survives.

Build a bag-of-words vector

function bow_vectorise(docs):
    vocab = sorted(unique_words(docs))   # the columns
    vectors = []
    for doc in docs:
        v = [0] * len(vocab)
        for word in tokenise(doc):
            v[ vocab.index(word) ] += 1     # count occurrences
        vectors.append(v)
    return vocab, vectors

# similarity between two docs = cosine of their count vectors
sim(a, b) = dot(a, b) / ( ||a|| * ||b|| )

Try it — vectorise & compare two documents

Edit either document. The shared vocabulary forms the columns; each doc becomes a row of counts. Cosine similarity measures overlap — notice it ignores order entirely.

Interactive · Bag-of-Words

Doc A

Doc B

Count vectors

CapturesWord presence & frequency

IgnoresOrder, grammar, meaning

Used forClassification, search, TF-IDF

N-grams counts + order

Key idea: approximate "the next word depends on everything before it" with "the next word depends only on the previous n−1 words." Estimate the probabilities by counting.

Train an n-gram model by counting

function train_ngram(corpus, n):
    counts = {}                          # (context) -> {next_word: count}
    for sentence in corpus:
        toks = tokenise(sentence)
        for i in 0 .. len(toks) - n:
            context = toks[i : i+n-1]
            next    = toks[i+n-1]
            counts[context][next] += 1
    return counts

# prediction (with simple back-off if context unseen)
P(next | context) = counts[context][next] / sum(counts[context])

Try it — slide the window & count

Pick an order, then Step to slide the n-gram window across the sentence, accumulating counts. Those counts are the model.

Interactive · N-gram counter

Sentence

Sentence & sliding window

Counts so far (top 10)

ContextPrevious n−1 words

WeaknessSparse & huge for large n

TricksSmoothing, back-off

Word2Vec embeddings

Key idea: "a word is known by the company it keeps." Train each word's vector to predict the words around it, and words used in similar contexts end up with similar vectors.

Skip-gram training data + objective

# 1. Slide a window; emit (center, context) pairs
for center_i in sentence:
    for j in [center_i - W .. center_i + W], j != center_i:
        emit( center=word[center_i], context=word[j] )

# 2. Learn vectors so the center predicts its context words
maximise  P(context | center) = softmax( v_context · v_center )
# gradient ascent nudges co-occurring words' vectors together

Try it — generate skip-gram pairs

Step through each center word; the window emits a training pair with every neighbour.

Interactive · Skip-gram pairs

Sentence

Sentence (center + context)

(center → context) pairs

…and explore the learned space

Once trained, vectors cluster by meaning. Click a word to see its nearest neighbours by cosine similarity (these vectors are illustrative, pre-placed in 2D).

Interactive · Embedding map

2D embedding space

Nearest neighbours (cosine)

OutputOne dense vector per word

CapturesSimilarity, analogies

LimitOne vector regardless of context

LSTM sequence memory

Key idea: read the sequence one step at a time, carrying a memory cell. Three gates decide what to forget, what to write, and what to output — letting information survive across many steps.

LSTM cell update (per time step)

# x = current input, h = previous hidden, c = previous cell
f = sigmoid(W_f·x + U_f·h + b_f)     # forget gate  (0..1)
i = sigmoid(W_i·x + U_i·h + b_i)     # input gate   (0..1)
o = sigmoid(W_o·x + U_o·h + b_o)     # output gate  (0..1)
g = tanh   (W_g·x + U_g·h + b_g)     # candidate memory

c = f * c  +  i * g                  # keep some old, add some new
h = o * tanh(c)                      # exposed output

Try it — watch the gates & memory

Feed a sequence of numbers (toy 1-unit LSTM, fixed weights). Step to see each gate open/close and the memory cell c and hidden state h evolve.

Interactive · LSTM gates

Sequence

Input sequence

Gate activations (this step)

State

ContextWhole sequence, via memory

StrengthVariable length, long-ish range

LimitSequential — hard to parallelise

Transformer attention

Key idea: let every token look at every other token directly and in parallel. Self-attention computes, for each token, a weighted blend of all tokens — the weights say "how much should I attend to you?"

Scaled dot-product self-attention

# each token gets Query, Key, Value vectors (learned projections)
Q = X · W_q,   K = X · W_k,   V = X · W_v

scores   = (Q · Kᵀ) / sqrt(d)        # how well each query matches each key
weights  = softmax(scores, axis = keys)   # each row sums to 1
output   = weights · V               # blend of values

# done for all tokens at once → fully parallel, unlike an RNN

Try it — the attention matrix

Step reveals one query row at a time: its scaled dot-products softmaxed into attention weights over every key. Brighter cells = more attention.

Interactive · Self-attention

Tokens

Attention weights (rows = query, cols = key, %)

Current query's distribution

ContextEvery token, all at once

StrengthParallel, long-range

PowersGPT, BERT, T5, Llama …

Note: this demo uses fixed, deterministic embeddings with Q=K=V (no trained projections), so it shows the mechanism of attention, not learned behaviour. Real models learn W_q, W_k, W_v and stack many attention heads and layers.

Language Models, Step by Step

Jump to

Bag-of-Words counts

N-grams counts + order

Word2Vec embeddings

LSTM sequence memory

Transformer attention