The same context, predicted by a unigram, bigram and trigram model (built-in cat/dog corpus).
How each model represents context, what it captures, and where it falls short.
| Model | Represents context as | Word order? | Context length | Output | Key limitation |
|---|---|---|---|---|---|
| Bag-of-Words | Count vector over vocab | No | Whole doc (unordered) | Sparse vector | Loses all order & meaning |
| N-grams | Counts of word sequences | Local | Previous n−1 words | Next-word probabilities | Sparse; tiny context |
| Word2Vec | Dense learned vectors | No* | Training window only | One vector per word | Context-free word meaning |
| LSTM | Recurrent hidden + cell state | Yes | Whole sequence (decays) | Contextual states / next word | Sequential; long-range fade |
| Transformer | Attention over all tokens | Yes | Whole sequence, direct | Contextual vectors / next token | Compute grows with length² |
* Word2Vec uses order within its training window but produces a single order-independent vector per word.