A context window is the maximum amount of text (measured in tokens) that a large language model (LLM) can process in a single interaction — including both the input you provide and the output it generates.
Every LLM processes text as a fixed-size block of tokens, where a token is roughly 3–4 characters or about 0.75 words in English. The context window defines the hard upper limit on how many tokens the model can 'see' at once. Anything outside that window is invisible to the model — it has no memory of it unless you explicitly include it again.
The context window is consumed by everything in a single request: the system prompt, conversation history, user messages, retrieved documents, and the model's own generated reply. For example, a 128,000-token window might hold roughly 96,000 words — about the length of a full novel. Developers must carefully budget token usage across all these components.
The context window directly determines how much information a model can reason over in one shot — longer windows enable summarizing large documents, maintaining long conversations, or processing entire codebases. Smaller windows force developers to use chunking, retrieval-augmented generation (RAG), or conversation truncation strategies. Choosing the right model for a task often comes down to its context window size versus its cost per token.
Transformer-based LLMs use self-attention to relate every token in the context to every other token, which is what gives the model its coherent understanding of the full input. This attention computation scales quadratically with sequence length (O(n²)), making very large context windows computationally expensive. Techniques like sliding-window attention and sparse attention are used to manage this cost at extreme lengths.
A common misconception is that placing information anywhere inside a large context window guarantees the model will use it accurately. Research shows LLMs often suffer from 'lost in the middle' degradation — they attend best to content at the very start and very end of the context, and can miss critical details buried in the middle. Always place the most important information near the beginning or end of your prompt for best results.
© RM Full Stack & AI Engineer · All guides · Roadmaps · Open the app