Token Budgeting & Cost Control: Managing LLM Spend at Scale
Your AI feature launches. Users love it. Usage grows 10x in a month. Your LLM API bill goes from $800 to $24,000. The feature that was supposed to improve margins is now your largest infrastructure cost after compute. Nobody modeled the per-request cost before launch because “it’s just API calls.” But at $15 per million output tokens, 50,000 daily users generating 500-token responses is $11,250/month - before input tokens, retrieval, and embeddings.
Token budgeting is not an optimization you do later. It is a design constraint you apply from the start, like memory budgets in embedded systems or bandwidth budgets in mobile apps.
The cost model
graph TD
subgraph costs["Per-Request Cost Components"]
IT["Input Tokens
(system + context + query)
$3-15 / 1M tokens"]
OT["Output Tokens
(model response)
$15-60 / 1M tokens"]
EMB["Embedding
(query embedding)
$0.02-0.13 / 1M tokens"]
VDB["Vector DB Query
$0.01-0.10 per query"]
end
subgraph total["Example: Single Request"]
EX["Input: 3000 tokens × $3/M = $0.009
Output: 500 tokens × $15/M = $0.0075
Embedding: 100 tokens × $0.13/M = $0.00001
Vector search: $0.01
Total: ~$0.027 per request"]
end
style IT fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style OT fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style EMB fill:#E1F5EE,stroke:#0F6E56,color:#085041
style EX fill:#FAEEDA,stroke:#854F0B,color:#633806
Cost estimation formula
def estimate_monthly_cost(
daily_requests: int,
avg_input_tokens: int,
avg_output_tokens: int,
input_price_per_m: float,
output_price_per_m: float,
):
monthly_requests = daily_requests * 30
input_cost = (monthly_requests * avg_input_tokens / 1_000_000) * input_price_per_m
output_cost = (monthly_requests * avg_output_tokens / 1_000_000) * output_price_per_m
return input_cost + output_cost
# Example: 50K requests/day, 3000 input, 500 output, GPT-4o pricing
cost = estimate_monthly_cost(50000, 3000, 500, 2.50, 10.00)
# Input: $11,250 + Output: $7,500 = $18,750/month
Cost control strategies
1. Model routing (40-70% savings)
Route simple queries to cheap models, complex to expensive. Most impactful single optimization.
2. Prompt caching (50-90% savings on cached portion)
Static system prompts and examples get cached at 10% of regular price.
3. Output length control
Set max_tokens appropriately. Classification: 10 tokens. Summary: 200. Never unlimited.
4. Context window optimization
Retrieve 3 focused chunks instead of 10 broad ones. Every unnecessary token in context costs money.
5. Response caching
Cache final responses for repeated identical queries. FAQ-type questions might have 80% cache hit rate.
6. Batch processing
Group non-urgent requests into batches (OpenAI Batch API: 50% discount, 24-hour completion).
7. Per-user/per-feature budgets
class TokenBudget:
def __init__(self, daily_limit: int):
self.daily_limit = daily_limit
self.used_today = 0
def can_spend(self, estimated_tokens: int):
return self.used_today + estimated_tokens <= self.daily_limit
def spend(self, actual_tokens: int):
self.used_today += actual_tokens
# Free tier: 10K tokens/day. Pro: 100K. Enterprise: unlimited.
Monitoring and alerting
Track daily/weekly/monthly spend. Alert on:
- Single request costing >$1 (runaway agent loop?)
- Daily spend exceeding 150% of average (traffic spike or bug?)
- Per-user spend exceeding tier limits
- New feature launches without cost estimates
Real-world cost management
- OpenAI - tiered pricing (GPT-4o at $2.50/$10 input/output, GPT-4o-mini at $0.15/$0.60)
- Anthropic - similar tiers (Opus at $15/$75, Sonnet at $3/$15, Haiku at $0.25/$1.25)
- Batch APIs - 50% discount for non-real-time processing (both OpenAI and Anthropic)
- Committed use discounts - enterprise agreements with volume discounts
How to apply in practice
Estimate costs BEFORE building. Model the expected request volume × per-request cost. If the unit economics do not work, redesign before building.
Set budget alerts at 80% and 100% of monthly targets. Investigate any spike immediately - it might be a bug (infinite loop) or abuse.
Track cost per user action. Not just total spend, but “generating a report costs $0.45” and “answering a chat question costs $0.03.” This informs pricing and tier decisions.
Review costs weekly during growth. Costs that are fine at 1000 users can be unsustainable at 100,000. Optimize proactively as you scale.
FAQ
Q: What is the typical cost per AI interaction for production applications?
Ranges widely: $0.001-0.005 for simple classification/extraction (small model, minimal context), $0.01-0.05 for standard RAG Q&A, $0.10-0.50 for complex agentic tasks (multiple model calls, tool use). If your per-interaction cost exceeds the value that interaction provides, the economics do not work.
Q: Input tokens are cheaper - should I use longer prompts to get shorter outputs?
Yes, strategically. Detailed instructions and examples (input tokens) reduce the model’s output verbosity and improve quality. Spending 500 extra input tokens ($0.0015) to reduce output by 200 tokens (saving $0.003) is a net win. But do not pad prompts unnecessarily.
Interview questions
Q: Your AI feature processes 200K requests/day. You need to reduce LLM costs by 60% without degrading quality. Propose a plan with estimated savings for each optimization.
Starting cost: estimate based on current model/tokens. Optimization plan: (1) Model routing - 50% of requests are simple → route to mini model. Savings: ~35% of total. (2) Prompt caching - 2000-token static prefix on every request. Savings: ~15% of total (saves on 50% of input token cost for cached portion). (3) Context reduction - reduce retrieved chunks from 8 to 4 (improve reranking to maintain quality). Savings: ~10% of total. (4) Output length limits - enforce max_tokens per task type, stop unnecessary verbosity. Savings: ~5%. (5) Response caching for top-50 most common queries (covers ~10% of traffic). Savings: ~5%. Combined: 35+15+10+5+5 = 70% theoretical savings. Realistically 55-65% after accounting for overlap and imperfect routing. Validate each step against eval suite before deploying.