Model Routing & Fallbacks: Using the Right Model for Each Task
Your AI application uses GPT-4 for every request. Monthly cost: $15,000. You analyze the traffic: 60% are simple questions (greetings, FAQ lookups, simple classification) that a $0.15/M-token model handles perfectly. 30% are moderate tasks (summarization, code review) where Claude Haiku works fine. Only 10% genuinely need GPT-4’s reasoning power (complex analysis, novel problem-solving). With routing, you send each request to the cheapest model that can handle it. Monthly cost drops to $3,200. Same quality where it matters. 78% cost reduction.
Model routing is not just cost optimization - it is also latency optimization (smaller models are faster) and reliability optimization (if one provider goes down, route to another). Production AI systems rarely use a single model for everything.
What model routing is
Model routing is a system that classifies incoming requests and directs them to the most appropriate model based on task complexity, required capabilities, cost constraints, and latency requirements.
graph TD REQ["Incoming Request"] --> CLASSIFY["Router (classify complexity)"] CLASSIFY -->|"simple"| SMALL["Small Model (Haiku, GPT-3.5) $0.25/M tokens 200ms TTFT"] CLASSIFY -->|"moderate"| MED["Medium Model (Sonnet, GPT-4o-mini) $3/M tokens 400ms TTFT"] CLASSIFY -->|"complex"| LARGE["Large Model (Opus, GPT-4) $15/M tokens 800ms TTFT"] SMALL -->|"confidence < threshold"| MED MED -->|"confidence < threshold"| LARGE style CLASSIFY fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style SMALL fill:#E1F5EE,stroke:#0F6E56,color:#085041 style MED fill:#FAEEDA,stroke:#854F0B,color:#633806 style LARGE fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
Routing strategies
Rule-based routing
Simple heuristics based on request properties:
def route_request(request):
# Short, simple queries → small model
if len(request.message) < 50 and request.task_type == "classification":
return "claude-haiku"
# Code generation → medium model
if request.task_type in ["code_generation", "summarization"]:
return "gpt-4o-mini"
# Multi-step reasoning, complex analysis → large model
if request.task_type in ["analysis", "planning", "multi_step"]:
return "gpt-4o"
# Default
return "gpt-4o-mini"
ML-based routing
Train a classifier on (request, best_model) pairs:
# Classifier trained on: "which model produces acceptable quality for this input?"
router_model = load_classifier("model_router_v2")
def smart_route(request):
features = extract_features(request) # length, complexity signals, task type
prediction = router_model.predict(features)
return prediction.model_name # Returns the cheapest model that will succeed
Cascade routing (progressive fallback)
Start with the cheapest model. If it fails or returns low confidence, escalate:
async def cascade_route(request):
# Try small model first
response = await small_model.generate(request)
if response.confidence > 0.8:
return response
# Escalate to medium
response = await medium_model.generate(request)
if response.confidence > 0.7:
return response
# Last resort: large model
return await large_model.generate(request)
Fallback patterns
Provider failover
When one provider is down, route to another:
PROVIDERS = [
{"name": "openai", "model": "gpt-4o", "priority": 1},
{"name": "anthropic", "model": "claude-sonnet", "priority": 2},
{"name": "google", "model": "gemini-pro", "priority": 3},
]
async def resilient_call(request):
for provider in sorted(PROVIDERS, key=lambda p: p["priority"]):
try:
return await call_provider(provider, request)
except (RateLimitError, TimeoutError, ServiceUnavailable):
continue
raise AllProvidersFailedError()
Quality-based fallback
If the primary model returns a response that fails quality checks:
async def quality_checked_call(request):
response = await primary_model.generate(request)
quality = evaluate_response(response)
if quality.score < QUALITY_THRESHOLD:
# Try with a more capable model
response = await fallback_model.generate(request)
return response
Real-world routing systems
- OpenRouter - unified API that routes across 100+ models based on price, speed, and capability
- Martian - ML-based model router that learns which model works best for each query type
- LiteLLM - proxy server with load balancing, fallbacks, and routing across providers
- Anthropic/OpenAI - tiered model families (Haiku/Sonnet/Opus, mini/4o/4) designed for routing use cases
How to apply in practice
Classify your traffic first. Before building a router, analyze what your users actually ask. If 80% is genuinely complex, routing saves little. If 60% is simple, routing is transformative.
Measure quality at each tier. Run your eval suite against each candidate model. A “small” model that fails 20% of moderate queries is not saving money - the retries and errors cost more than using the medium model upfront.
Monitor routing accuracy. Track: what percentage of requests routed to the small model needed escalation? If it is >10%, your routing threshold is too aggressive.
Include latency in the routing decision. For real-time interactions (chatbots), a faster small model that takes 200ms beats a more capable model that takes 2000ms - even if the small model’s answers are slightly less detailed.
FAQ
Q: How do I decide the routing threshold?
Start with your eval: run all test queries against each model, measure quality. Set the threshold so that 95%+ of queries routed to the small model produce acceptable quality. This is usually a function of query complexity (length, reasoning steps required, domain specificity). Tune empirically by monitoring production quality at each tier.
Q: Does model routing hurt user experience?
Not if done correctly. Users do not know (or care) which model serves them - they care about response quality and speed. If the small model provides adequate quality for simple queries AND responds faster, the user experience actually improves for those queries.
Interview questions
Q: Design a model routing system for a company spending $50K/month on LLM APIs. 40% of requests are simple FAQ responses, 35% are moderate complexity, 25% are complex multi-step reasoning. How do you reduce costs while maintaining quality?
Current state: all requests go to expensive model (~$50K). Target architecture: (1) FAQ (40%): route to smallest model (Haiku-tier). These are pattern-match responses - if eval shows 97%+ accuracy, switch immediately. Savings: ~$16K/month. (2) Moderate (35%): route to mid-tier model (Sonnet/4o-mini). Eval must show <3% quality degradation vs the expensive model. Savings: ~$8K/month. (3) Complex (25%): keep on expensive model. No change. Total savings: ~$24K/month (48% reduction). Implementation: start with rule-based routing (query length + task type classification), graduate to ML-based router trained on (query, quality_at_each_tier) data collected over 2 weeks. Monitor continuously with automated quality sampling.