Tokens are the fundamental units of text that large language models read, process, and generate. Understanding tokens is essential for working effectively with any LLM API, controlling costs, and reasoning about model behavior and limitations.
A token is a chunk of text that a language model treats as a single unit — it is neither always a word nor always a character. Common words like 'cat' are typically one token, while longer or rarer words like 'tokenization' may be split into multiple tokens such as 'token' and 'ization'. Punctuation, spaces, and special characters also consume tokens. On average, one token corresponds to roughly 3–4 characters or about 0.75 English words.
Before text enters a model, a tokenizer converts raw strings into sequences of integer IDs using a fixed vocabulary built during training. The most common algorithm is Byte-Pair Encoding (BPE), which iteratively merges the most frequent character pairs to form subword units, balancing vocabulary size against coverage. Each model family (GPT, Llama, Gemini, etc.) ships its own tokenizer and vocabulary, so the same string can produce different token counts across models. You can inspect tokenization using tools like OpenAI's Tokenizer Playground or the Hugging Face tokenizers library.
LLMs have a fixed context window measured in tokens — for example, 128,000 tokens — which caps how much text the model can 'see' at once, including both the input prompt and the generated output. API pricing is almost universally based on input and output token counts, so token awareness directly affects cost at scale. Verbose prompts, large documents, and long conversation histories consume context space quickly, potentially causing the model to forget earlier content through truncation.
Token counts are split into two categories in most APIs: prompt tokens (everything you send in) and completion tokens (everything the model generates back). Completion tokens are often priced higher because generation is more computationally expensive than encoding the input. Setting a max_tokens parameter caps how many tokens the model will generate in its response, preventing unexpectedly long and costly outputs.
Tokenization efficiency varies significantly by language and content type. Non-Latin scripts such as Chinese, Arabic, or Hebrew are often tokenized into far more tokens per word than equivalent English text, making multilingual use cases disproportionately expensive and context-hungry. Code, JSON, URLs, and numbers with many digits also tokenize inefficiently. Always test token counts for your specific content domain rather than relying on the English-language rule of thumb.
Always count tokens programmatically before sending a request, especially when building production applications with dynamic content. Use the official tokenizer library for your model — for OpenAI models use the tiktoken library, for Hugging Face models use AutoTokenizer — to get exact counts. Implement chunking strategies for large documents to stay within context limits, and monitor token usage in API responses to track spend and catch prompt-bloat regressions early.
© RM Full Stack & AI Engineer · All guides · Roadmaps · Open the app