Tokenization: Definition and Legal AI

Tokenization

Tokenization is the process of breaking text down into "tokens", the basic units processed by an AI. A token can be a whole word, part of a word, or a punctuation mark. Models have a maximum token limit (the context window): 128K for GPT-4o, 200K for Claude, which determines how much text the AI can process.

Tokenization is the foundational step by which human text is converted into numerical units that an AI model can understand. A token does not always correspond to a single word: the term "jurisprudence" can be split into "juris" + "prudence", that is, two tokens. On average, one token represents roughly 3/4 of a word in English. This granularity lets the model handle a virtually unlimited vocabulary from a finite set of tokens.

For legal professionals, tokenization matters because it determines the context window, meaning the maximum amount of text the model can process in a single request. GPT-4o offers 128,000 tokens (about 96,000 words), while Claude goes up to 200,000 tokens (about 150,000 words). It is this capacity that makes it possible to analyze lengthy contracts or voluminous court decisions in a single pass.

Understanding tokenization also helps optimize costs: AI APIs often bill on a usage basis by tokens (input + output). A well-structured prompt that avoids repetition and gets straight to the point will be not only more effective but also less expensive. Legaltech tools handle this optimization transparently for the end user.

Related terms