Model Routing & Fallbacks: Using the Right Model for Each Task

Your AI application uses GPT-4 for every request. Monthly cost: $15,000. You analyze the traffic: 60% are simple questions (greetings, FAQ lookups, simple classification) that a $0.15/M-token model handles perfectly. 30% are moderate tasks (summarization, code review) where Claude Haiku works fine. Only 10% genuinely need GPT-4’s reasoning power (complex analysis, novel problem-solving). With routing, you send each request to the cheapest model that can handle it. Monthly cost drops to $3,200. Same quality where it matters. 78% cost reduction.

Model routing is not just cost optimization - it is also latency optimization (smaller models are faster) and reliability optimization (if one provider goes down, route to another). Production AI systems rarely use a single model for everything.

What model routing is

Model routing is a system that classifies incoming requests and directs them to the most appropriate model based on task complexity, required capabilities, cost constraints, and latency requirements.

graph TD
  REQ["Incoming Request"] --> CLASSIFY["Router
(classify complexity)"]
  CLASSIFY -->|"simple"| SMALL["Small Model
(Haiku, GPT-3.5)
$0.25/M tokens
200ms TTFT"]
  CLASSIFY -->|"moderate"| MED["Medium Model
(Sonnet, GPT-4o-mini)
$3/M tokens
400ms TTFT"]
  CLASSIFY -->|"complex"| LARGE["Large Model
(Opus, GPT-4)
$15/M tokens
800ms TTFT"]
  SMALL -->|"confidence < threshold"| MED
  MED -->|"confidence < threshold"| LARGE

  style CLASSIFY fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style SMALL fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style MED fill:#FAEEDA,stroke:#854F0B,color:#633806
  style LARGE fill:#FCEBEB,stroke:#A32D2D,color:#791F1F

Routing strategies

Rule-based routing

Simple heuristics based on request properties:

def route_request(request):
    # Short, simple queries → small model
    if len(request.message) < 50 and request.task_type == "classification":
        return "claude-haiku"
    
    # Code generation → medium model
    if request.task_type in ["code_generation", "summarization"]:
        return "gpt-4o-mini"
    
    # Multi-step reasoning, complex analysis → large model
    if request.task_type in ["analysis", "planning", "multi_step"]:
        return "gpt-4o"
    
    # Default
    return "gpt-4o-mini"

ML-based routing

Train a classifier on (request, best_model) pairs:

# Classifier trained on: "which model produces acceptable quality for this input?"
router_model = load_classifier("model_router_v2")

def smart_route(request):
    features = extract_features(request)  # length, complexity signals, task type
    prediction = router_model.predict(features)
    return prediction.model_name  # Returns the cheapest model that will succeed

Cascade routing (progressive fallback)

Start with the cheapest model. If it fails or returns low confidence, escalate:

async def cascade_route(request):
    # Try small model first
    response = await small_model.generate(request)
    if response.confidence > 0.8:
        return response
    
    # Escalate to medium
    response = await medium_model.generate(request)
    if response.confidence > 0.7:
        return response
    
    # Last resort: large model
    return await large_model.generate(request)

Fallback patterns

Provider failover

When one provider is down, route to another:

PROVIDERS = [
    {"name": "openai", "model": "gpt-4o", "priority": 1},
    {"name": "anthropic", "model": "claude-sonnet", "priority": 2},
    {"name": "google", "model": "gemini-pro", "priority": 3},
]

async def resilient_call(request):
    for provider in sorted(PROVIDERS, key=lambda p: p["priority"]):
        try:
            return await call_provider(provider, request)
        except (RateLimitError, TimeoutError, ServiceUnavailable):
            continue
    raise AllProvidersFailedError()

Quality-based fallback

If the primary model returns a response that fails quality checks:

async def quality_checked_call(request):
    response = await primary_model.generate(request)
    
    quality = evaluate_response(response)
    if quality.score < QUALITY_THRESHOLD:
        # Try with a more capable model
        response = await fallback_model.generate(request)
    
    return response

Real-world routing systems

OpenRouter - unified API that routes across 100+ models based on price, speed, and capability
Martian - ML-based model router that learns which model works best for each query type
LiteLLM - proxy server with load balancing, fallbacks, and routing across providers
Anthropic/OpenAI - tiered model families (Haiku/Sonnet/Opus, mini/4o/4) designed for routing use cases

How to apply in practice

Classify your traffic first. Before building a router, analyze what your users actually ask. If 80% is genuinely complex, routing saves little. If 60% is simple, routing is transformative.

Measure quality at each tier. Run your eval suite against each candidate model. A “small” model that fails 20% of moderate queries is not saving money - the retries and errors cost more than using the medium model upfront.

Monitor routing accuracy. Track: what percentage of requests routed to the small model needed escalation? If it is >10%, your routing threshold is too aggressive.

Include latency in the routing decision. For real-time interactions (chatbots), a faster small model that takes 200ms beats a more capable model that takes 2000ms - even if the small model’s answers are slightly less detailed.

FAQ

Q: How do I decide the routing threshold?

Start with your eval: run all test queries against each model, measure quality. Set the threshold so that 95%+ of queries routed to the small model produce acceptable quality. This is usually a function of query complexity (length, reasoning steps required, domain specificity). Tune empirically by monitoring production quality at each tier.

Q: Does model routing hurt user experience?

Not if done correctly. Users do not know (or care) which model serves them - they care about response quality and speed. If the small model provides adequate quality for simple queries AND responds faster, the user experience actually improves for those queries.

Interview questions

Q: Design a model routing system for a company spending $50K/month on LLM APIs. 40% of requests are simple FAQ responses, 35% are moderate complexity, 25% are complex multi-step reasoning. How do you reduce costs while maintaining quality?

Current state: all requests go to expensive model (~$50K). Target architecture: (1) FAQ (40%): route to smallest model (Haiku-tier). These are pattern-match responses - if eval shows 97%+ accuracy, switch immediately. Savings: ~$16K/month. (2) Moderate (35%): route to mid-tier model (Sonnet/4o-mini). Eval must show <3% quality degradation vs the expensive model. Savings: ~$8K/month. (3) Complex (25%): keep on expensive model. No change. Total savings: ~$24K/month (48% reduction). Implementation: start with rule-based routing (query length + task type classification), graduate to ML-based router trained on (query, quality_at_each_tier) data collected over 2 weeks. Monitor continuously with automated quality sampling.