Tokenisation Algorithms

This page is for students who want to implement tokenisers themselves. The visualiser shows what each method produces; here we cover how each one works — the data structures, the pseudocode, and a worked trace you can follow with pen and paper before writing a single line of code.

One idea to hold onto: subword tokenisers have two separate phases. Training learns a vocabulary (and sometimes merge rules) from a corpus — you do this once, offline. Encoding applies that fixed vocabulary to new text — this runs every time the model sees input. Mixing these two up is the most common bug when people implement BPE for the first time.

The tokenisation pipeline

Almost every tokeniser, no matter how fancy, is the same four stages in a row. When you implement one, keep these as separate functions so you can test each in isolation.

Stage overview

# 1. NORMALISE   raw text  -> cleaned text
#    lowercase, strip accents, NFKC unicode, collapse spaces (optional)
# 2. PRE-TOKENISE cleaned  -> chunks  (usually "words" + whitespace)
#    split on spaces/punctuation so merges never cross word boundaries
# 3. MODEL        chunks   -> tokens  (BPE / WordPiece / Unigram / ...)
#    the actual algorithm; this is the interesting part
# 4. POST-PROCESS tokens   -> ids     (look up each token in the vocab)
#    add special tokens like [CLS], [SEP], <s>, </s>

Stages 1, 2 and 4 are bookkeeping. The algorithm choices everyone argues about live entirely in stage 3, which is what the rest of this page is about. Note that pre-tokenisation matters: classic BPE and WordPiece split into words first so a merge can never span a space, whereas SentencePiece deliberately skips this step and treats the space as just another character (more on that below).

Baselines rule-based

Word, character and sentence tokenisation need no training — they are pure rules. They are worth implementing first because they give you the normalise/pre-tokenise scaffolding the subword methods reuse.

Character

The trivial case: every Unicode code point is a token. Tiny, fixed vocabulary; very long sequences.

Character tokenise

function char_tokenise(text):
    return list(text)        # split into code points

Word

Split on whitespace and (usually) peel punctuation off as its own token. The subtlety is what counts as a boundary — a regex makes the rules explicit and testable.

Word tokenise

function word_tokenise(text):
    # one or more letters/digits  OR  a single punctuation char
    return regex_findall(r"[A-Za-z0-9]+|[^\sA-Za-z0-9]", text)

Sentence

Break on . ! ? — but guard against abbreviations (Dr., e.g.) that contain a full stop without ending a sentence. A small block-list of abbreviations handles most English cases.

Sentence tokenise

function sentence_tokenise(text):
    sentences = []
    buffer = ""
    for word in split_keeping_spaces(text):
        buffer += word
        if ends_with(buffer, [".", "!", "?"])
           and last_word(buffer) not in ABBREVIATIONS:
            sentences.append(trim(buffer))
            buffer = ""
    if trim(buffer): sentences.append(trim(buffer))
    return sentences

Try it — live baseline tokeniser

These need no stepping — they're pure rules. Type and switch modes to see the boundaries move.

Interactive · baselines

Text

Tokens

Vocab size Char: tiny · Word: huge (every form) · Sentence: unbounded

Out-of-vocab Char: never · Word: frequent (any unseen word)

Training None — pure rules

Byte Pair Encoding subword

Key idea: start from individual characters and repeatedly merge the most frequent adjacent pair into a new symbol. Each merge becomes a permanent rule. The list of merges, in the order they were learned, is the trained model.

Training — learn the merges

Represent each word as a sequence of characters plus an end-of-word marker (here </w>) so the algorithm can tell er at the end of a word from er in the middle. Count word frequencies once; you never need the raw corpus again.

BPE training

function bpe_train(corpus, num_merges):
    # word -> frequency, each word as a tuple of symbols
    vocab = count_words(corpus)            # {("l","o","w","</w>"): 5, ...}
    merges = []                          # ordered list of learned rules

    repeat num_merges times:
        pairs = count_adjacent_pairs(vocab)   # {("e","s"): 9, ("s","t"): 7, ...}
        if pairs is empty: break
        best = argmax(pairs)               # most frequent pair
        vocab = merge_pair(best, vocab)     # replace every "e","s" with "es"
        merges.append(best)

    return merges                        # the trained model

function count_adjacent_pairs(vocab):
    counts = {}
    for word, freq in vocab:
        for (a, b) in consecutive_pairs(word):
            counts[(a, b)] += freq      # weight by word frequency!
    return counts

Encoding — apply the merges

To tokenise new text, split each word into characters and apply the learned merges in the order they were learned. That ordering is what makes BPE deterministic and reproducible.

BPE encode

function bpe_encode(word, merges):
    symbols = list(word) + ["</w>"]
    for (a, b) in merges:            # learned order is essential
        symbols = merge_pair_in((a, b), symbols)
    return symbols

# Faster variant: instead of looping over ALL merges, repeatedly find the
# adjacent pair in `symbols` with the lowest merge-rank, apply it, repeat
# until no pair is mergeable. Same result, scales to large merge tables.

Worked trace

Learning to compress the word lowest, given these were the top pairs found in the corpus:

start: l · o · w · e · s · t

merge e+s: l · o · w · es · t

merge l+o: lo · w · es · t

merge lo+w: low · es · t

merge es+t: low · est

merge low+est: lowest

A rare word like "lowestest" would simply stop merging earlier and come out as low + est + est — that graceful fallback to known fragments is exactly why subword models handle unseen words so well.

Try it — train BPE step by step

Edit the corpus, then press Step to apply one merge at a time. The middle panel shows the live adjacent-pair counts (the winner is highlighted — that's what gets merged next); watch the word table rebuild and the merge list grow.

Interactive · BPE trainer

Corpus

Words → current symbols

Adjacent-pair counts

Learned merges (in order)

Merge choice Most frequent adjacent pair

Training cost ~O(merges × pairs); speed up with a pair-count heap

Used by GPT-2/3/4 (byte-level), RoBERTa

Byte-level BPE (used by GPT models) runs this exact algorithm over the 256 raw UTF-8 bytes instead of Unicode characters. The starting alphabet is then always exactly 256 symbols, so nothing is ever truly out-of-vocabulary — any text, any language, any emoji, decomposes into bytes.

WordPiece subword

Key idea: almost identical to BPE during training, but instead of merging the most frequent pair it merges the pair that most increases the likelihood of the training data — roughly, the pair whose parts occur together far more than chance would predict.

Training — score, don't just count

For a candidate pair (a, b), BPE asks "how often does ab appear?". WordPiece instead asks "how much does gluing them help?" using the score below. This favours merges where the pair is much more common than its pieces would be independently, which tends to produce more linguistically meaningful units.

WordPiece merge score

# pick the pair maximising this ratio, not the raw count
score(a, b) = freq(a, b) / ( freq(a) × freq(b) )

function wordpiece_train(corpus, vocab_size):
    vocab = init_with_characters(corpus)   # continuations get "##" prefix
    while size(vocab) < vocab_size:
        pairs = count_adjacent_pairs(corpus, vocab)
        best  = argmax(pairs, key = score)   # likelihood, not frequency
        if best is none: break
        vocab.add(merge(best))
    return vocab                          # a set of pieces (no merge list)

Encoding — greedy longest match

This is where WordPiece differs most from BPE. There is no merge list to replay; instead you scan each word left-to-right and, at every position, take the longest piece in the vocabulary that fits. Continuation pieces carry a ## prefix. If no prefix matches, the whole word becomes [UNK].

WordPiece encode (greedy longest-match-first)

function wordpiece_encode(word, vocab):
    tokens = []
    start = 0
    while start < len(word):
        end = len(word)
        piece = none
        while start < end:                # shrink the window from the right
            sub = word[start:end]
            if start > 0: sub = "##" + sub  # mark continuation
            if sub in vocab:
                piece = sub; break
            end -= 1
        if piece is none:
            return ["[UNK]"]            # whole word unmatchable
        tokens.append(piece)
        start = end
    return tokens

Worked trace

Encoding playing against a vocab that contains play and ##ing:

window "playing" → "playin" → ... → "play" ✓ (longest match at start)

window "ing" → try "##ing" → "##ing" ✓

result: [ play , ##ing ]

Try it — greedy longest-match, step by step

Set a word and a vocabulary, then Step to shrink the search window one character at a time. At each start position WordPiece takes the longest vocab piece that fits; ## marks a continuation. (Single characters are always kept as a fallback, so you rarely hit [UNK].)

Interactive · WordPiece encoder

Word

Vocab

Word & search window

Vocabulary

+ every single character (a–z, ##a–##z) as fallback

Tokens so far

Merge choice Highest likelihood score

Encoding Greedy longest-match, left to right

Used by BERT, DistilBERT, ELECTRA

Unigram LM & SentencePiece subword

Key idea: BPE and WordPiece build up a vocabulary by merging. The Unigram model goes the other way: it starts huge and prunes down. It keeps a probability for every candidate piece and, for any word, picks the segmentation that is most probable overall.

First, a naming note. SentencePiece is a library/framework, not an algorithm. It can train either a BPE or a Unigram model. Its real contribution is treating the input as a raw stream of Unicode — it does no pre-tokenisation and encodes the space itself as a visible marker ▁ (U+2581). That means it is fully reversible and works on languages like Japanese or Chinese that don't separate words with spaces. The algorithm people usually pair it with is Unigram, described here.

Encoding — Viterbi over segmentations

Given a trained set of pieces each with a probability, a word can usually be split many ways. Unigram scores a segmentation as the product of its piece probabilities (sum of log-probs) and uses dynamic programming — the Viterbi algorithm — to find the single best split in linear time.

Unigram encode (Viterbi best segmentation)

function unigram_encode(word, logp):
    N = len(word)
    best = array(N + 1, fill = -infinity)   # best[i] = score of word[:i]
    back = array(N + 1, fill = -1)          # back-pointer for reconstruction
    best[0] = 0

    for i in 1 .. N:
        for j in 0 .. i-1:
            piece = word[j:i]
            if piece in logp:
                cand = best[j] + logp[piece]
                if cand > best[i]:
                    best[i] = cand
                    back[i] = j

    return walk_backpointers(back, word)    # reconstruct the pieces

Training — EM with pruning

Training learns those piece probabilities. Seed an over-complete vocabulary (e.g. all substrings up to some length, or a big BPE run), then alternate Expectation-Maximisation to re-estimate probabilities and prune the pieces that contribute least, until you hit the target size.

Unigram training (sketch)

function unigram_train(corpus, vocab_size, keep = 0.8):
    V = seed_large_vocab(corpus)        # way more pieces than needed
    p = init_probs(V)                   # e.g. from substring frequencies

    while size(V) > vocab_size:
        # E-step: best (or marginal) segmentation of each word under p
        counts = expected_piece_counts(corpus, p)
        # M-step: re-estimate each piece probability
        p = normalise(counts)
        # Prune: drop pieces whose removal costs the least likelihood
        loss = per_piece_loss(V, p, corpus)
        V = keep_top(V, fraction = keep, by = loss)
        # (always keep single characters so every word stays encodable)

    return p                            # piece -> probability

Worked trace

Segmenting ▁cats when the vocab scores ▁cat highly and s exists:

candidate A: ▁ · c · a · t · s → low total log-prob (many rare pieces)

candidate B: ▁cat · s → highest total log-prob ✓

Viterbi returns: [ ▁cat , s ]

Try it — Viterbi best segmentation, step by step

Each vocab entry has a log-probability (higher = more likely; values are negative). Step advances the end position; at each one the lattice records the best-scoring way to reach it. When it reaches the end, the back-pointers light up the single most probable split.

Interactive · Unigram / Viterbi

Word

Vocab

Lattice — best cumulative log-prob at each position

Candidates ending at current position

Best segmentation

Direction Top-down: prune a big vocab

Encoding Viterbi (most probable split)

Bonus Can sample alternative splits (subword regularisation)

Used by T5, ALBERT, XLNet, mBART, Llama (via SP)

Implementation tips

A few things that trip people up when they build these from scratch:

Weight pair counts by word frequency. In BPE/WordPiece training you count over the word-frequency table, not the raw token stream — forgetting the frequency weight gives wrong merges.
Keep the end-of-word marker. Without </w> (or the ▁ space marker) you can't distinguish word-final pieces, and decoding becomes ambiguous.
Preserve merge order for BPE. The trained model is the ordered list of merges; applying them in any other order produces different (wrong) tokens.
Always keep single characters / bytes in the vocab. This guarantees every input is encodable and avoids [UNK] entirely (byte-level BPE takes this to its logical conclusion).
Test encode∘decode round-trips. Tokenising then de-tokenising should return the original text (modulo the normalisation you chose). This single test catches most bugs.
Validate against a reference. Once it works, compare your output to Hugging Face tokenizers or Google's sentencepiece on the same corpus to confirm correctness.

Ready to see them in action? Head to the Visualiser to watch tokens form as you type, or the Compare page to see BPE, WordPiece and SentencePiece handle the same input side by side.

Tokenisation Algorithms

Jump to

The tokenisation pipeline

Baselines rule-based

Byte Pair Encoding subword

WordPiece subword

Unigram LM & SentencePiece subword

Implementation tips