Context Window Management: Fitting the Right Information in Limited Space
Your AI coding assistant works perfectly on small files. Then a user opens a 5000-line monolith and asks “refactor the payment processing logic.” You stuff the entire file into context. The model is sluggish - 8 seconds to first token. The response addresses code from line 200 but completely ignores the payment logic at line 3800. You have technically fit the file in context (it is under the 128K limit), but the model cannot meaningfully attend to all of it. And you are paying $0.15 per request for that massive context.
The next day, a user with a microservices project asks the same kind of question. The relevant logic is spread across 12 files. You cannot fit all 12 in context, so you pick 4 that seem most relevant. The model’s response references a function from one of the 8 files you excluded. It hallucinates the function signature because it cannot see the actual code.
Context window management is the art of getting the right information into the right amount of space - not too much (degradation, cost, latency), not too little (missing context, hallucination), and structured so the model can actually use it.
What context window management actually is
Context window management is the set of strategies for deciding what content occupies the model’s limited context window for each request. It encompasses:
- What to include: Which documents, code, history, or instructions are relevant
- What to exclude: What can be safely omitted without degrading quality
- How to compress: Summarizing, truncating, or reformatting content to fit
- Where to place: Positioning content for maximum model attention
- How to budget: Allocating token “slots” across competing content needs
It is constraint optimization: maximize response quality within fixed token and cost budgets.
graph TD
subgraph budget["Context Window Budget (128K tokens)"]
B1["System Prompt
500-2000 tokens
(fixed)"]
B2["Conversation History
0-10,000 tokens
(grows over time)"]
B3["Retrieved Context (RAG)
2000-8000 tokens
(varies per query)"]
B4["User Input
50-2000 tokens
(unpredictable)"]
B5["Reserved for Output
500-4000 tokens
(must reserve)"]
B6["Safety Buffer
~10% of window
(prevents truncation)"]
end
style B1 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style B2 fill:#FAEEDA,stroke:#854F0B,color:#633806
style B3 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style B4 fill:#F1EFE8,stroke:#888780,color:#444441
style B5 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style B6 fill:#F1EFE8,stroke:#888780,color:#444441
Token budgeting
The first step is defining how your context window is allocated. Create an explicit budget:
MAX_CONTEXT = 128000
TOKEN_BUDGET = {
"system_prompt": 1500, # Fixed
"few_shot_examples": 800, # Fixed
"conversation_history": 4000, # Variable, capped
"retrieved_context": 6000, # Variable, capped
"user_input": 2000, # Variable
"output_reserve": 4000, # Must reserve for generation
"safety_buffer": 5000, # Never use this
}
# Total allocated: 23,300 tokens
# Remaining headroom: 104,700 tokens
Always reserve tokens for the model’s output. If you fill the context to 127,500 tokens and the model needs 2000 tokens to respond, it will be forced to truncate its own response.
Conversation history management
Multi-turn conversations grow unboundedly. Without management, a 30-turn conversation easily exceeds any context window. Strategies from simple to sophisticated:
Sliding window
Keep only the last N turns. Simple but lossy - earlier context that might be referenced is completely gone:
def sliding_window(messages, max_turns=10):
return messages[-max_turns:]
Summarization
Summarize older turns into a compact representation, keeping recent turns verbatim:
def managed_history(messages, max_tokens=4000):
recent = messages[-6:] # Keep last 6 turns verbatim
if token_count(recent) < max_tokens:
return recent
older = messages[:-6]
summary = summarize(older) # LLM call to compress
return [{"role": "system", "content": f"Previous conversation summary: {summary}"}] + recent
Hierarchical memory
Multiple levels of compression:
- Immediate (last 3 turns): verbatim
- Recent (turns 4-10): key points extracted
- Long-term (turns 11+): topic-level summary
- Persistent (cross-session): stored in a database, retrieved when relevant
graph LR
subgraph strategies["History Management Strategies"]
direction TB
SW["Sliding Window
Simple, lossy
Keep last N turns"]
SUM["Summarization
Compresses older turns
Preserves key info"]
HIER["Hierarchical
Multi-level compression
Best quality/cost"]
end
style SW fill:#F1EFE8,stroke:#888780,color:#444441
style SUM fill:#E1F5EE,stroke:#0F6E56,color:#085041
style HIER fill:#EEEDFE,stroke:#534AB7,color:#3C3489
Retrieved context management
For RAG applications, the challenge is fitting retrieved documents while maintaining quality:
Chunking strategy affects context usage
Smaller chunks (200-300 tokens) let you fit more documents but may split important context across chunks. Larger chunks (800-1500 tokens) preserve context but limit how many you can include. The right size depends on your content:
- FAQ/definitions: Small chunks (200-400 tokens) - each answer is self-contained
- Technical documentation: Medium chunks (500-800 tokens) - need enough context for procedures
- Legal/research: Larger chunks (1000-1500 tokens) - arguments span paragraphs
Retrieval budget allocation
def allocate_rag_budget(query, budget_tokens=6000):
# Retrieve more than we'll use
candidates = vector_search(query, top_k=20)
# Rerank for relevance
ranked = reranker.rank(query, candidates)
# Fill budget greedily
selected = []
used_tokens = 0
for chunk in ranked:
chunk_tokens = count_tokens(chunk)
if used_tokens + chunk_tokens > budget_tokens:
break
selected.append(chunk)
used_tokens += chunk_tokens
return selected
Relevance-weighted truncation
Not all retrieved chunks deserve equal space. High-relevance chunks get full inclusion; lower-relevance chunks get summarized or truncated:
def smart_context(ranked_chunks, budget):
context_parts = []
remaining = budget
for i, chunk in enumerate(ranked_chunks):
if i < 3: # Top 3: full inclusion
context_parts.append(chunk.full_text)
elif i < 7: # Next 4: key sentences only
context_parts.append(extract_key_sentences(chunk, max_sentences=3))
else:
break # Rest: exclude
remaining -= count_tokens(context_parts[-1])
if remaining <= 0:
break
return "\n\n".join(context_parts)
Positioning for attention
The “lost in the middle” effect means content placement matters:
High attention zones:
- Beginning of context (system prompt area)
- End of context (most recent content, closest to generation)
Low attention zone:
- Middle of long contexts
Practical implications:
- Put the most critical instructions at the start AND end of the system prompt
- Place the highest-relevance retrieved content first or last in the context block
- Put conversation history (which is reference material) in the middle
- Place the user’s actual question as the final content before generation
Dynamic context sizing
Not every request needs the full budget. A simple “hello” should not trigger 6000 tokens of RAG retrieval:
def dynamic_budget(query):
complexity = classify_complexity(query) # simple, moderate, complex
if complexity == "simple":
return {"rag_budget": 0, "history_budget": 1000}
elif complexity == "moderate":
return {"rag_budget": 3000, "history_budget": 2000}
else: # complex
return {"rag_budget": 8000, "history_budget": 4000}
Benefits: lower cost on simple queries, faster TTFT, and less noise for the model to process.
Where context management breaks
Information splitting across chunks
A critical fact might be split between two chunks during retrieval. Chunk A has “The API rate limit is” and Chunk B has “500 requests per minute.” Neither chunk alone answers the question. Solutions: overlapping chunks, parent-document retrieval, or combining adjacent chunks when both score highly.
Context pollution
Retrieving too much marginally-relevant content dilutes the signal. The model attends to irrelevant passages and produces generic or confused responses. Better to include 3 highly relevant chunks than 10 somewhat-relevant ones.
Stale context in long conversations
In a 20-turn conversation, early context might contain information the user has since corrected. The model might reference outdated information from turn 3 even though turn 15 provided an update. Active state tracking (maintaining a “current facts” document) helps.
Token counting inaccuracy
If your token counting uses a different tokenizer than the model, your budget calculations are wrong. Always use the target model’s actual tokenizer for counting. Off-by-10% errors accumulate across context sections and can cause unexpected truncation.
Real-world implementations
- ChatGPT - uses hierarchical conversation management with automatic summarization of older turns, dynamic tool/context selection based on query
- Cursor - indexes the full codebase via embeddings, retrieves only relevant files/functions for each query, uses file-level summaries for broader context
- Notion AI - chunks workspace content, retrieves relevant blocks, and includes page-level metadata for navigation context
- Perplexity - retrieves web content, extracts key passages, summarizes lengthy pages before inclusion, and positions sources by relevance
- Amazon Q - manages enterprise context across multiple data sources with dynamic retrieval and permission-aware filtering
How to apply in practice
Always count tokens before sending. Never assume content fits. Build token counting into your pipeline and fail gracefully when budgets are exceeded.
Implement graceful degradation. When context exceeds budget: first reduce history, then reduce RAG chunks, then summarize remaining content. Never truncate the user’s question or the system prompt.
Monitor context utilization. Track average context window usage per request. If you are consistently using only 10% of available context, you might be under-utilizing retrieval. If you are consistently hitting limits, you need better compression or a model with a larger window.
Profile attention patterns. Use the model’s logprobs or attention visualization (where available) to verify that retrieved context is actually being used in generation. If certain context sections never influence outputs, stop including them.
Version your context strategy. Changes to chunk sizes, retrieval counts, history management, or budget allocation affect output quality. A/B test changes against your eval suite before deploying.
FAQ
Q: Should I always use the largest context window available?
No. Larger windows cost more per token processed, increase latency, and suffer from attention degradation on long contexts. Use the smallest context window that produces acceptable quality for your task. A 32K window with well-curated content often outperforms a 128K window stuffed with marginally relevant material. Pay for context window size only when you genuinely need it.
Q: How do I handle the case where the user asks about something from 50 turns ago?
Hybrid approach: keep a compressed summary of the full conversation history, plus a searchable index of key facts/decisions from earlier turns. When the user references something old (“what was that pricing we discussed earlier?”), search the index to retrieve the specific turn and inject it into current context. This is essentially RAG over conversation history.
Q: Is it better to send many small requests or one large request with everything?
Depends on the task. For independent questions: many small requests are cheaper (no wasted context) and more parallelizable. For tasks requiring cross-reference (comparing documents, multi-step reasoning over a dataset): one large request with all relevant context produces more coherent outputs. The break-even point is usually around the point where the model needs to reason across multiple pieces of information simultaneously.
Interview questions
Q: Design the context management strategy for an AI customer support bot that handles 1000 concurrent conversations, each potentially lasting 50+ turns.
Hierarchical approach: (1) Fixed budget: 2000 tokens for system prompt (cached), 1000 for product knowledge base (most relevant FAQ entries), 3000 for conversation context, 500 for current query. (2) Conversation management: last 5 turns verbatim, turns 6-15 as extracted key points (decisions made, issues identified), turns 16+ as a topic-level summary refreshed every 10 turns. (3) Dynamic retrieval: based on current query, pull relevant KB articles into the context. (4) Persistence: store full transcripts externally for compliance, but only load relevant portions into context. (5) Cost optimization: detect simple queries (greetings, confirmations) and skip RAG retrieval entirely. At 1000 concurrent conversations, budget-aware context management saves $500+/day compared to naive full-history inclusion.
Q: Your RAG application retrieves 10 chunks per query but users report the model “ignores” some retrieved information. Diagnose and propose solutions.
Likely causes: (1) Lost-in-the-middle effect - chunks in positions 4-7 get lower attention. Fix: reorder by relevance with highest-relevance chunks first and last. (2) Too many chunks dilute attention - the model has too much to process. Fix: reduce to top 5 chunks, or summarize lower-ranked chunks into a single paragraph. (3) Chunks are too similar - 3 chunks say roughly the same thing, wasting budget. Fix: deduplicate retrieved chunks using similarity thresholds before inclusion. (4) The user’s question is not explicit enough to focus attention. Fix: reformulate the query as a specific instruction: “Using the context below, answer specifically about X.” Test by removing chunks one at a time and checking if the answer changes - if removing a chunk has no effect, it was being ignored anyway.
Q: You need to reduce LLM costs by 50% without significantly degrading quality. What context management optimizations would you implement?
Ordered by impact: (1) Prompt caching - if system prompt is >1000 tokens and consistent, this alone saves 40-50% on that portion. (2) Dynamic context sizing - route simple queries (40% of traffic) through minimal context paths, saving 60% on those requests. (3) Reduce chunk count - drop from 8 retrieved chunks to 4, use a reranker to ensure top 4 are highest quality. (4) Summarize conversation history aggressively - compress turns older than 5 into a 200-token summary instead of keeping 2000 tokens of raw history. (5) Output token limits - set appropriate max_tokens per task type (classification: 10 tokens, summary: 500, full response: 2000). Measure quality impact at each step using your eval suite. Stop optimizing when quality drops below your threshold.