Five models, each with the idea, the pseudocode, and an interactive widget you can drive yourself. They build on each other — start at the top and watch what each one adds.
function bow_vectorise(docs): vocab = sorted(unique_words(docs)) # the columns vectors = [] for doc in docs: v = [0] * len(vocab) for word in tokenise(doc): v[ vocab.index(word) ] += 1 # count occurrences vectors.append(v) return vocab, vectors # similarity between two docs = cosine of their count vectors sim(a, b) = dot(a, b) / ( ||a|| * ||b|| )
Edit either document. The shared vocabulary forms the columns; each doc becomes a row of counts. Cosine similarity measures overlap — notice it ignores order entirely.
function train_ngram(corpus, n): counts = {} # (context) -> {next_word: count} for sentence in corpus: toks = tokenise(sentence) for i in 0 .. len(toks) - n: context = toks[i : i+n-1] next = toks[i+n-1] counts[context][next] += 1 return counts # prediction (with simple back-off if context unseen) P(next | context) = counts[context][next] / sum(counts[context])
Pick an order, then Step to slide the n-gram window across the sentence, accumulating counts. Those counts are the model.
# 1. Slide a window; emit (center, context) pairs for center_i in sentence: for j in [center_i - W .. center_i + W], j != center_i: emit( center=word[center_i], context=word[j] ) # 2. Learn vectors so the center predicts its context words maximise P(context | center) = softmax( v_context · v_center ) # gradient ascent nudges co-occurring words' vectors together
Step through each center word; the window emits a training pair with every neighbour.
Once trained, vectors cluster by meaning. Click a word to see its nearest neighbours by cosine similarity (these vectors are illustrative, pre-placed in 2D).
# x = current input, h = previous hidden, c = previous cell f = sigmoid(W_f·x + U_f·h + b_f) # forget gate (0..1) i = sigmoid(W_i·x + U_i·h + b_i) # input gate (0..1) o = sigmoid(W_o·x + U_o·h + b_o) # output gate (0..1) g = tanh (W_g·x + U_g·h + b_g) # candidate memory c = f * c + i * g # keep some old, add some new h = o * tanh(c) # exposed output
Feed a sequence of numbers (toy 1-unit LSTM, fixed weights). Step to see each gate open/close and the memory
cell c and hidden state h evolve.
# each token gets Query, Key, Value vectors (learned projections) Q = X · W_q, K = X · W_k, V = X · W_v scores = (Q · Kᵀ) / sqrt(d) # how well each query matches each key weights = softmax(scores, axis = keys) # each row sums to 1 output = weights · V # blend of values # done for all tokens at once → fully parallel, unlike an RNN
Step reveals one query row at a time: its scaled dot-products softmaxed into attention weights over every key. Brighter cells = more attention.
Note: this demo uses fixed, deterministic embeddings with Q=K=V (no trained projections), so it shows the mechanism of attention, not learned behaviour. Real models learn W_q, W_k, W_v and stack many attention heads and layers.