Tokens & Context Windows: The Real Constraints of LLM Applications
Your RAG pipeline retrieves 20 relevant documents and stuffs them all into the prompt along with the user’s question and system instructions. The response comes back, but it completely ignores documents 15 through 20. Your retrieval was perfect. Your ranking was perfect. But the model simply could not attend to that much content meaningfully. You just hit the context window wall - not the hard limit where the API rejects your request, but the soft limit where performance degrades because attention becomes diluted across too many tokens.
This is one of the most common mistakes in LLM application development: treating the context window like a bucket you can fill to the brim. It is not. It is more like a spotlight with a fixed amount of brightness - the more you illuminate, the dimmer each part gets.
What tokens actually are
A token is the atomic unit of text that an LLM processes. It is not a word, not a character, and not a syllable. It is a subword unit determined by the model’s tokenizer.
Most modern LLMs use Byte-Pair Encoding (BPE) or a variant. BPE starts with individual characters and iteratively merges the most frequent pairs into new tokens. The result is a vocabulary of 32K-200K tokens that can represent any text efficiently.
Some tokenization patterns to internalize:
- Common English words are single tokens: “the”, “and”, “is” = 1 token each
- Longer or less common words get split: “tokenization” = “token” + “ization” (2 tokens)
- Code is often expensive: variable names, indentation, and symbols add up fast
- Non-English languages typically use 2-4x more tokens for equivalent meaning
- Numbers are tokenized digit by digit or in small groups: “123456” might be 3+ tokens
- Whitespace and punctuation are tokens too
graph LR
subgraph input["Input Text"]
A["'Hello, how are you doing today?'"]
end
subgraph tokens["Tokenized (7 tokens)"]
T1["Hello"]
T2[","]
T3[" how"]
T4[" are"]
T5[" you"]
T6[" doing"]
T7[" today?"]
end
A --> T1
A --> T2
A --> T3
A --> T4
A --> T5
A --> T6
A --> T7
style A fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style T1 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style T2 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style T3 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style T4 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style T5 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style T6 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style T7 fill:#E1F5EE,stroke:#0F6E56,color:#085041
Practical token math
A rough heuristic for English text: 1 token is roughly 4 characters or 0.75 words. So:
- 1,000 words is approximately 1,300 tokens
- A typical page of text is roughly 400-500 tokens
- A 10-page document is roughly 4,000-5,000 tokens
- A full codebase file (500 lines) might be 2,000-4,000 tokens depending on language
This matters because you are paying per token (input and output) and constrained by the context window.
Context windows explained
The context window is the maximum number of tokens the model can process in a single forward pass. It includes everything: system prompt, user message, retrieved documents, conversation history, and the model’s response.
Current context window sizes (as of 2026):
- GPT-4o: 128K tokens
- Claude 3.5/4: 200K tokens
- Gemini 1.5 Pro: 2M tokens
- LLaMA 3: 8K-128K tokens depending on variant
- Mistral Large: 128K tokens
graph TB
subgraph cw["Context Window (128K tokens)"]
direction TB
SP["System Prompt
~500 tokens"]
CH["Conversation History
~2,000 tokens"]
RAG["Retrieved Documents
~8,000 tokens"]
UQ["User Query
~100 tokens"]
RES["Model Response
~2,000 tokens"]
end
subgraph budget["Token Budget"]
USED["Used: ~12,600"]
FREE["Available: ~115,400"]
end
style SP fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style CH fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style RAG fill:#E1F5EE,stroke:#0F6E56,color:#085041
style UQ fill:#FAEEDA,stroke:#854F0B,color:#633806
style RES fill:#FAEEDA,stroke:#854F0B,color:#633806
style USED fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style FREE fill:#E1F5EE,stroke:#0F6E56,color:#085041
The “lost in the middle” problem
Research from Stanford and others has shown that LLMs struggle with information placed in the middle of long contexts. They attend well to the beginning and end of the context, but retrieval accuracy drops significantly for content in the middle. This is not a tokenization issue - it is an attention pattern issue. The model’s attention mechanism tends to form a U-shaped curve of effectiveness.
Practical implication: put your most important context at the beginning or end of the prompt, not buried in the middle.
How context windows actually work under the hood
During generation, the model maintains a KV cache - a stored representation of all previous tokens’ Key and Value vectors. For each new token generated, the model computes attention against this entire cache. This is why:
- Prefill is expensive: Processing the initial prompt requires computing attention for all input tokens against each other. For a 100K token input, this is enormous.
- Generation is cheaper per token: Each new token only needs to attend to the existing cache, not reprocess everything.
- Memory scales linearly with context: The KV cache grows with each token. A 128K context window with 96 attention heads needs gigabytes of GPU memory just for the cache.
Where it breaks: the practical limits
Limit 1: Hard rejection. Exceed the context window and the API returns an error. This is the easy case - you know it failed.
Limit 2: Silent degradation. Fill 90% of a 128K window and the model’s ability to follow instructions, maintain coherence, and recall specific details drops measurably. The API does not tell you this is happening.
Limit 3: Cost explosion. Context window pricing is per-token for both input and output. Stuffing 100K tokens into every request at $3/M input tokens costs $0.30 per request. At 1000 requests/hour, that is $7,200/day just for input tokens.
Limit 4: Latency. Time-to-first-token scales with input length. A 100K token prompt takes noticeably longer to start generating than a 1K token prompt.
Real-world systems and their strategies
- ChatGPT - uses a sliding window over conversation history, summarizing older messages to stay within limits
- Cursor/Copilot - selectively includes relevant code files rather than the entire codebase, using embeddings to pick what to include
- Perplexity - retrieves and summarizes web content into compact chunks rather than injecting full pages
- Notion AI - chunks documents and only retrieves the relevant sections for the user’s query
- Google NotebookLM - pre-processes uploaded documents into a structured index, retrieving specific passages on demand
How to manage context windows in practice
Strategy 1: Token counting before sending. Always count tokens before making an API call. Use tiktoken (OpenAI) or the model’s tokenizer to get exact counts. Build in a buffer - aim for 80% utilization max.
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode(full_prompt)
if len(tokens) > MAX_CONTEXT * 0.8:
# Truncate or summarize
pass
Strategy 2: Hierarchical summarization. For conversation history, summarize older turns rather than including verbatim. “The user asked about caching patterns and I explained write-through vs write-back” uses fewer tokens than the full exchange.
Strategy 3: Smart retrieval over stuffing. Do not retrieve 20 documents and hope the model finds the answer. Retrieve 3-5 highly relevant chunks, rank them, and place the most relevant at the top of the context.
Strategy 4: Prompt compression. Tools like LLMLingua can compress prompts by 2-5x while preserving semantic content. Useful when you must include large contexts but want to reduce cost and latency.
Strategy 5: Structured context. Use headers, bullet points, and clear delimiters to help the model parse your context efficiently. Unstructured walls of text are harder for the model to attend to selectively.
FAQ
Q: If a model has a 200K context window, can I reliably use all 200K tokens?
No. The advertised context window is the hard maximum, not the reliable working range. Most models show measurable degradation in recall and instruction-following above 50-60% utilization. Treat the context window like disk space - you can fill it, but performance suffers long before you hit the limit.
Q: Do input tokens and output tokens cost the same?
Almost never. Output tokens are typically 2-5x more expensive than input tokens because generation requires sequential computation (one token at a time), while input processing can be parallelized. This is why “verbose” system prompts that reduce the model’s output length can actually save money.
Q: Why do some models tokenize the same text differently?
Each model has its own tokenizer trained on its specific training data. GPT-4’s tokenizer (cl100k_base) has ~100K vocabulary items. Claude’s tokenizer is different. This means the same text produces different token counts across models - always use the target model’s tokenizer for accurate counting.
Interview questions
Q: You are building a customer support chatbot that needs to reference a 50-page product manual. How do you handle the context window constraint?
Strong answers describe a RAG approach: chunk the manual into semantic sections, embed them, retrieve only relevant chunks for each query (3-5 chunks typically), and place them in context with the user’s question. Mention token budgeting - reserving tokens for system prompt, conversation history, and response. Great answers discuss fallback strategies when retrieval confidence is low.
Q: A user reports that your AI assistant “forgets” earlier parts of a long conversation. What is happening technically, and how would you fix it?
The conversation history exceeds what the model can meaningfully attend to. Solutions: implement a sliding window with summarization of older turns, use a separate memory store that the model can query, or implement a hierarchical memory where recent turns are verbatim and older turns are compressed summaries. The key insight is that this is an attention/capacity issue, not a bug.
Q: Your RAG system retrieves relevant documents but the model still gives incorrect answers. The documents are correct. What might be wrong?
Classic “lost in the middle” problem - relevant info is buried in the middle of a large context. Other causes: too many retrieved documents diluting attention, relevant passage split across chunk boundaries, or the model’s instruction to use the documents is too weak. Fix by: reducing chunk count, reranking with relevant chunks first/last, using explicit quotes in the prompt, or adding “answer ONLY from the provided documents” instruction.