What is Tokenisation?

Tokenisation is the process of splitting a stream of text into smaller pieces called tokens — words, subwords, characters or symbols. It's the first step in almost every NLP pipeline, because a model can't work with raw text directly: each token is mapped to a number before it ever reaches the network.

Which kind of token you produce depends on the method. The choice trades off vocabulary size against sequence length and how gracefully unseen words are handled — modern large language models settle on subword tokenisation as the sweet spot.

Types of Tokenisation

Pick any method in the visualiser above to see it live. Each one in brief:

BPE subword

Byte Pair Encoding starts from individual characters and repeatedly merges the most frequent adjacent pair into a new token. Balances vocabulary size against sequence length, and falls back gracefully on rare words. Used by the GPT family.

WordPiece subword

A BERT-style subword method. Like BPE, but it picks the merge that most improves the likelihood of the training data rather than the most frequent one. Continuation pieces are marked with ##.

SentencePiece subword

Treats text as a raw character stream with no pre-tokenisation, encoding spaces as ▁. Fully reversible and language-agnostic — great for scripts that don't separate words with spaces. Usually paired with a Unigram model.

Word word-level

Splits text on spaces and punctuation so each word is its own token. Intuitive and readable, but the vocabulary grows huge and any unseen word becomes an unknown.

Sentence sentence-level

Breaks text into sentences at . ! ?, guarding against abbreviations. Often a first pass before finer tokenisation.

Character character-level

Every character becomes a token. The vocabulary is tiny and nothing is ever out-of-vocabulary, but sequences become very long.

Want the mechanics? The Algorithms page builds BPE, WordPiece and SentencePiece step by step, and Compare shows them side by side on the same input.

Tokenisation Visualiser