Glossaire · GEO

Tokenization

Tokenization is the process by which a language model (LLM) breaks a text into elementary units called tokens before processing it. A token is not exactly a word: it can correspond to a short word, a word fragment, a punctuation mark or a sequence of characters. In English, a model counts roughly 1.3 tokens per word, while languages with longer or less common words tend to be split into more pieces. Tokenization drives the cost of an API call (billed per token), the maximum context length a model can ingest, and the way it segments and then understands content. For generative engine optimization, understanding tokenization helps you structure dense, self-contained passages that a model can isolate, vectorize and cite. Clear content, segmented into factual sentences, tokenizes and gets reused more easily by a generative AI.

Tokenization is the invisible yet decisive step that precedes any processing by a language model. Before it can understand, summarize or cite your content, an AI turns it into a sequence of tokens. Mastering this mechanism lets you write content that models ingest and reuse without friction.

How it works

A model does not read words but numerical identifiers. The tokenizer applies an algorithm — most often Byte Pair Encoding (BPE) — that learns the most frequent character sequences and groups them into tokens. Common words become a single token, while rare words, proper nouns or technical terms are split into several pieces. This sequence of tokens is then converted into vectors (see embedding) that the model manipulates.

Why it matters

Tokenization has three direct consequences. First, cost: LLM APIs bill per token, both input and output. Second, the context window: a model caps at a maximum number of tokens, which limits how much text it can process at once. Third, comprehension: a poorly structured text, overloaded with jargon or unusual characters, tokenizes irregularly and becomes harder to segment cleanly.

Key takeaway

Writing for AI means writing for tokenization: short sentences, self-contained facts and clear vocabulary break into clean tokens and get cited more often.

A concrete example

The word "optimization" may be split into two or three tokens depending on the tokenizer, whereas "SEO" takes only one. This is no trivial detail: a page dense with rare terms consumes more tokens and offers less clean segmentation boundaries. By structuring your content into factual, self-sufficient passages, you make both chunking and citation extraction easier for generative engines.

FAQ

Frequently asked questions

Not exactly. A token is a unit of segmentation that can be a whole word, a word fragment, a space or a punctuation mark. In English a word averages around 1.3 tokens, with long or rare words split into several pieces.

Because it determines how a model segments and prices your content. Short, factual, self-contained passages tokenize cleanly and are easier for a generative AI to isolate and then cite.

Go further