Token Budgeting & Cost Control: Managing LLM Spend at Scale

Your AI feature launches. Users love it. Usage grows 10x in a month. Your LLM API bill goes from $800 to $24,000. The feature that was supposed to improve margins is now your largest infrastructure cost after compute. Nobody modeled the per-request cost before launch because “it’s just API calls.” But at $15 per million output tokens, 50,000 daily users generating 500-token responses is $11,250/month - before input tokens, retrieval, and embeddings.

Token budgeting is not an optimization you do later. It is a design constraint you apply from the start, like memory budgets in embedded systems or bandwidth budgets in mobile apps.

The cost model

graph TD
  subgraph costs["Per-Request Cost Components"]
      IT["Input Tokens
(system + context + query)
$3-15 / 1M tokens"]
      OT["Output Tokens
(model response)
$15-60 / 1M tokens"]
      EMB["Embedding
(query embedding)
$0.02-0.13 / 1M tokens"]
      VDB["Vector DB Query
$0.01-0.10 per query"]
  end
  subgraph total["Example: Single Request"]
      EX["Input: 3000 tokens × $3/M = $0.009
Output: 500 tokens × $15/M = $0.0075
Embedding: 100 tokens × $0.13/M = $0.00001
Vector search: $0.01
Total: ~$0.027 per request"]
  end

  style IT fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
  style OT fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
  style EMB fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style EX fill:#FAEEDA,stroke:#854F0B,color:#633806

Cost estimation formula

def estimate_monthly_cost(
    daily_requests: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    input_price_per_m: float,
    output_price_per_m: float,
):
    monthly_requests = daily_requests * 30
    input_cost = (monthly_requests * avg_input_tokens / 1_000_000) * input_price_per_m
    output_cost = (monthly_requests * avg_output_tokens / 1_000_000) * output_price_per_m
    return input_cost + output_cost

# Example: 50K requests/day, 3000 input, 500 output, GPT-4o pricing
cost = estimate_monthly_cost(50000, 3000, 500, 2.50, 10.00)
# Input: $11,250 + Output: $7,500 = $18,750/month

Cost control strategies

1. Model routing (40-70% savings)

Route simple queries to cheap models, complex to expensive. Most impactful single optimization.

2. Prompt caching (50-90% savings on cached portion)

Static system prompts and examples get cached at 10% of regular price.

3. Output length control

Set max_tokens appropriately. Classification: 10 tokens. Summary: 200. Never unlimited.

4. Context window optimization

Retrieve 3 focused chunks instead of 10 broad ones. Every unnecessary token in context costs money.

5. Response caching

Cache final responses for repeated identical queries. FAQ-type questions might have 80% cache hit rate.

6. Batch processing

Group non-urgent requests into batches (OpenAI Batch API: 50% discount, 24-hour completion).

7. Per-user/per-feature budgets

class TokenBudget:
    def __init__(self, daily_limit: int):
        self.daily_limit = daily_limit
        self.used_today = 0
    
    def can_spend(self, estimated_tokens: int):
        return self.used_today + estimated_tokens <= self.daily_limit
    
    def spend(self, actual_tokens: int):
        self.used_today += actual_tokens

# Free tier: 10K tokens/day. Pro: 100K. Enterprise: unlimited.

Monitoring and alerting

Track daily/weekly/monthly spend. Alert on:

Single request costing >$1 (runaway agent loop?)
Daily spend exceeding 150% of average (traffic spike or bug?)
Per-user spend exceeding tier limits
New feature launches without cost estimates

Real-world cost management

OpenAI - tiered pricing (GPT-4o at $2.50/$10 input/output, GPT-4o-mini at $0.15/$0.60)
Anthropic - similar tiers (Opus at $15/$75, Sonnet at $3/$15, Haiku at $0.25/$1.25)
Batch APIs - 50% discount for non-real-time processing (both OpenAI and Anthropic)
Committed use discounts - enterprise agreements with volume discounts

How to apply in practice

Estimate costs BEFORE building. Model the expected request volume × per-request cost. If the unit economics do not work, redesign before building.

Set budget alerts at 80% and 100% of monthly targets. Investigate any spike immediately - it might be a bug (infinite loop) or abuse.

Track cost per user action. Not just total spend, but “generating a report costs $0.45” and “answering a chat question costs $0.03.” This informs pricing and tier decisions.

Review costs weekly during growth. Costs that are fine at 1000 users can be unsustainable at 100,000. Optimize proactively as you scale.

FAQ

Q: What is the typical cost per AI interaction for production applications?

Ranges widely: $0.001-0.005 for simple classification/extraction (small model, minimal context), $0.01-0.05 for standard RAG Q&A, $0.10-0.50 for complex agentic tasks (multiple model calls, tool use). If your per-interaction cost exceeds the value that interaction provides, the economics do not work.

Q: Input tokens are cheaper - should I use longer prompts to get shorter outputs?

Yes, strategically. Detailed instructions and examples (input tokens) reduce the model’s output verbosity and improve quality. Spending 500 extra input tokens ($0.0015) to reduce output by 200 tokens (saving $0.003) is a net win. But do not pad prompts unnecessarily.

Interview questions

Q: Your AI feature processes 200K requests/day. You need to reduce LLM costs by 60% without degrading quality. Propose a plan with estimated savings for each optimization.

Starting cost: estimate based on current model/tokens. Optimization plan: (1) Model routing - 50% of requests are simple → route to mini model. Savings: ~35% of total. (2) Prompt caching - 2000-token static prefix on every request. Savings: ~15% of total (saves on 50% of input token cost for cached portion). (3) Context reduction - reduce retrieved chunks from 8 to 4 (improve reranking to maintain quality). Savings: ~10% of total. (4) Output length limits - enforce max_tokens per task type, stop unnecessary verbosity. Savings: ~5%. (5) Response caching for top-50 most common queries (covers ~10% of traffic). Savings: ~5%. Combined: 35+15+10+5+5 = 70% theoretical savings. Realistically 55-65% after accounting for overlap and imperfect routing. Validate each step against eval suite before deploying.