Skip to content

Tokenization

LLM & Language Models

The process of splitting text into tokens — the fundamental preprocessing step before any AI language model can process your input.

Tokenization is how AI models convert human-readable text into numbers they can process. Different models use different tokenizers: GPT models use a byte-pair encoding (BPE) tokenizer, while others might use SentencePiece or WordPiece. The same text can produce different token counts depending on the tokenizer.

Tokenization explains some quirky AI behaviors. Models struggle with character-level tasks (counting letters, reversing words) because they don't see individual characters — they see tokens. 'Strawberry' might be tokenized as 'str' + 'aw' + 'berry', making it hard for the model to count the letter 'r'.

For most users, tokenization is invisible. But if you're optimizing prompts for cost, debugging unexpected model behavior, or working with non-English text (which often tokenizes less efficiently), understanding tokenization helps.

Real-World Example

If an AI struggles to count letters in a word or has trouble with unusual spelling — tokenization is likely why. The model sees word fragments, not individual characters.

Related Terms

More in LLM & Language Models

FAQ

What is Tokenization?

The process of splitting text into tokens — the fundamental preprocessing step before any AI language model can process your input.

How is Tokenization used in practice?

If an AI struggles to count letters in a word or has trouble with unusual spelling — tokenization is likely why. The model sees word fragments, not individual characters.

What concepts are related to Tokenization?

Key related concepts include Token, LLM (Large Language Model), Context Window. Understanding these together gives a more complete picture of how Tokenization fits into the AI landscape.