Type text to tokenise!
Tokenisation is the process of splitting a stream of text into smaller pieces called tokens — words, subwords, characters or symbols. It's the first step in almost every NLP pipeline, because a model can't work with raw text directly: each token is mapped to a number before it ever reaches the network.
Which kind of token you produce depends on the method. The choice trades off vocabulary size against sequence length and how gracefully unseen words are handled — modern large language models settle on subword tokenisation as the sweet spot.
Pick any method in the visualiser above to see it live. Each one in brief:
Byte Pair Encoding starts from individual characters and repeatedly merges the most frequent adjacent pair into a new token. Balances vocabulary size against sequence length, and falls back gracefully on rare words. Used by the GPT family.
A BERT-style subword method. Like BPE, but it picks the merge that most improves the likelihood of the
training data rather than the most frequent one. Continuation pieces are marked with ##.
Treats text as a raw character stream with no pre-tokenisation, encoding spaces as ▁. Fully
reversible and language-agnostic — great for scripts that don't separate words with spaces. Usually paired
with a Unigram model.
Splits text on spaces and punctuation so each word is its own token. Intuitive and readable, but the vocabulary grows huge and any unseen word becomes an unknown.
Breaks text into sentences at . ! ?, guarding against abbreviations.
Often a first pass before finer tokenisation.
Every character becomes a token. The vocabulary is tiny and nothing is ever out-of-vocabulary, but sequences become very long.
Want the mechanics? The Algorithms page builds BPE, WordPiece and SentencePiece step by step, and Compare shows them side by side on the same input.