This page is for students who want to implement tokenisers themselves. The visualiser shows what each method produces; here we cover how each one works — the data structures, the pseudocode, and a worked trace you can follow with pen and paper before writing a single line of code.
Almost every tokeniser, no matter how fancy, is the same four stages in a row. When you implement one, keep these as separate functions so you can test each in isolation.
# 1. NORMALISE raw text -> cleaned text # lowercase, strip accents, NFKC unicode, collapse spaces (optional) # 2. PRE-TOKENISE cleaned -> chunks (usually "words" + whitespace) # split on spaces/punctuation so merges never cross word boundaries # 3. MODEL chunks -> tokens (BPE / WordPiece / Unigram / ...) # the actual algorithm; this is the interesting part # 4. POST-PROCESS tokens -> ids (look up each token in the vocab) # add special tokens like [CLS], [SEP], <s>, </s>
Stages 1, 2 and 4 are bookkeeping. The algorithm choices everyone argues about live entirely in stage 3, which is what the rest of this page is about. Note that pre-tokenisation matters: classic BPE and WordPiece split into words first so a merge can never span a space, whereas SentencePiece deliberately skips this step and treats the space as just another character (more on that below).
Word, character and sentence tokenisation need no training — they are pure rules. They are worth implementing first because they give you the normalise/pre-tokenise scaffolding the subword methods reuse.
The trivial case: every Unicode code point is a token. Tiny, fixed vocabulary; very long sequences.
function char_tokenise(text): return list(text) # split into code points
Split on whitespace and (usually) peel punctuation off as its own token. The subtlety is what counts as a boundary — a regex makes the rules explicit and testable.
function word_tokenise(text): # one or more letters/digits OR a single punctuation char return regex_findall(r"[A-Za-z0-9]+|[^\sA-Za-z0-9]", text)
Break on . ! ? — but guard against abbreviations (Dr.,
e.g.) that contain a full stop without ending a sentence. A small block-list of abbreviations
handles most English cases.
function sentence_tokenise(text): sentences = [] buffer = "" for word in split_keeping_spaces(text): buffer += word if ends_with(buffer, [".", "!", "?"]) and last_word(buffer) not in ABBREVIATIONS: sentences.append(trim(buffer)) buffer = "" if trim(buffer): sentences.append(trim(buffer)) return sentences
These need no stepping — they're pure rules. Type and switch modes to see the boundaries move.
Represent each word as a sequence of characters plus an end-of-word marker (here </w>) so the
algorithm can tell er at the end of a word from er in the middle. Count word
frequencies once; you never need the raw corpus again.
function bpe_train(corpus, num_merges): # word -> frequency, each word as a tuple of symbols vocab = count_words(corpus) # {("l","o","w","</w>"): 5, ...} merges = [] # ordered list of learned rules repeat num_merges times: pairs = count_adjacent_pairs(vocab) # {("e","s"): 9, ("s","t"): 7, ...} if pairs is empty: break best = argmax(pairs) # most frequent pair vocab = merge_pair(best, vocab) # replace every "e","s" with "es" merges.append(best) return merges # the trained model function count_adjacent_pairs(vocab): counts = {} for word, freq in vocab: for (a, b) in consecutive_pairs(word): counts[(a, b)] += freq # weight by word frequency! return counts
To tokenise new text, split each word into characters and apply the learned merges in the order they were learned. That ordering is what makes BPE deterministic and reproducible.
function bpe_encode(word, merges): symbols = list(word) + ["</w>"] for (a, b) in merges: # learned order is essential symbols = merge_pair_in((a, b), symbols) return symbols # Faster variant: instead of looping over ALL merges, repeatedly find the # adjacent pair in `symbols` with the lowest merge-rank, apply it, repeat # until no pair is mergeable. Same result, scales to large merge tables.
Learning to compress the word lowest, given these were the top pairs found in the corpus:
A rare word like "lowestest" would simply stop merging earlier and come out as
low + est + est — that graceful fallback to known fragments is exactly why subword models handle
unseen words so well.
Edit the corpus, then press Step to apply one merge at a time. The middle panel shows the live adjacent-pair counts (the winner is highlighted — that's what gets merged next); watch the word table rebuild and the merge list grow.
O(merges × pairs); speed up with a pair-count heap
Byte-level BPE (used by GPT models) runs this exact algorithm over the 256 raw UTF-8 bytes instead of Unicode characters. The starting alphabet is then always exactly 256 symbols, so nothing is ever truly out-of-vocabulary — any text, any language, any emoji, decomposes into bytes.
For a candidate pair (a, b), BPE asks "how often does ab appear?". WordPiece instead
asks "how much does gluing them help?" using the score below. This favours merges where the pair is much more
common than its pieces would be independently, which tends to produce more linguistically meaningful units.
# pick the pair maximising this ratio, not the raw count score(a, b) = freq(a, b) / ( freq(a) × freq(b) ) function wordpiece_train(corpus, vocab_size): vocab = init_with_characters(corpus) # continuations get "##" prefix while size(vocab) < vocab_size: pairs = count_adjacent_pairs(corpus, vocab) best = argmax(pairs, key = score) # likelihood, not frequency if best is none: break vocab.add(merge(best)) return vocab # a set of pieces (no merge list)
This is where WordPiece differs most from BPE. There is no merge list to replay; instead you scan each word
left-to-right and, at every position, take the longest piece in the vocabulary that fits.
Continuation pieces carry a ## prefix. If no prefix matches, the whole word becomes
[UNK].
function wordpiece_encode(word, vocab): tokens = [] start = 0 while start < len(word): end = len(word) piece = none while start < end: # shrink the window from the right sub = word[start:end] if start > 0: sub = "##" + sub # mark continuation if sub in vocab: piece = sub; break end -= 1 if piece is none: return ["[UNK]"] # whole word unmatchable tokens.append(piece) start = end return tokens
Encoding playing against a vocab that contains play and ##ing:
Set a word and a vocabulary, then Step to shrink the search window one character at a time.
At each start position WordPiece takes the longest vocab piece that fits; ## marks a
continuation. (Single characters are always kept as a fallback, so you rarely hit [UNK].)
First, a naming note. SentencePiece is a library/framework, not an algorithm. It can
train either a BPE or a Unigram model. Its real contribution is treating the input as a raw stream of Unicode —
it does no pre-tokenisation and encodes the space itself as a visible marker ▁
(U+2581). That means it is fully reversible and works on languages like Japanese or Chinese that don't separate
words with spaces. The algorithm people usually pair it with is Unigram, described here.
Given a trained set of pieces each with a probability, a word can usually be split many ways. Unigram scores a segmentation as the product of its piece probabilities (sum of log-probs) and uses dynamic programming — the Viterbi algorithm — to find the single best split in linear time.
function unigram_encode(word, logp): N = len(word) best = array(N + 1, fill = -infinity) # best[i] = score of word[:i] back = array(N + 1, fill = -1) # back-pointer for reconstruction best[0] = 0 for i in 1 .. N: for j in 0 .. i-1: piece = word[j:i] if piece in logp: cand = best[j] + logp[piece] if cand > best[i]: best[i] = cand back[i] = j return walk_backpointers(back, word) # reconstruct the pieces
Training learns those piece probabilities. Seed an over-complete vocabulary (e.g. all substrings up to some length, or a big BPE run), then alternate Expectation-Maximisation to re-estimate probabilities and prune the pieces that contribute least, until you hit the target size.
function unigram_train(corpus, vocab_size, keep = 0.8): V = seed_large_vocab(corpus) # way more pieces than needed p = init_probs(V) # e.g. from substring frequencies while size(V) > vocab_size: # E-step: best (or marginal) segmentation of each word under p counts = expected_piece_counts(corpus, p) # M-step: re-estimate each piece probability p = normalise(counts) # Prune: drop pieces whose removal costs the least likelihood loss = per_piece_loss(V, p, corpus) V = keep_top(V, fraction = keep, by = loss) # (always keep single characters so every word stays encodable) return p # piece -> probability
Segmenting ▁cats when the vocab scores ▁cat highly and s exists:
Each vocab entry has a log-probability (higher = more likely; values are negative). Step advances the end position; at each one the lattice records the best-scoring way to reach it. When it reaches the end, the back-pointers light up the single most probable split.
A few things that trip people up when they build these from scratch:
</w> (or the ▁ space
marker) you can't distinguish word-final pieces, and decoding becomes ambiguous.[UNK] entirely (byte-level BPE takes this to its logical conclusion).tokenizers or Google's sentencepiece on the same corpus to confirm correctness.Ready to see them in action? Head to the Visualiser to watch tokens form as you type, or the Compare page to see BPE, WordPiece and SentencePiece handle the same input side by side.