Deployment Patterns for AI: Getting LLM Applications to Production


You ship your AI feature behind a feature flag. All users get it at once. Three hours later, you discover the new system prompt has a subtle bug - it gives confidently wrong answers for a specific category of questions. 10,000 users have already seen bad responses. You revert, but the damage to trust is done.

Traditional software deployment patterns (blue-green, canary, feature flags) apply to AI systems, but AI adds unique challenges: the same code produces different outputs depending on model version, prompt content, retrieved context, and even time of day. You need deployment patterns that account for the non-deterministic, evolving nature of AI systems.

What makes AI deployment different

graph TD
  subgraph traditional["Traditional Software Deploy"]
      T1["Code change → Test → Deploy"]
      T2["Behavior is deterministic"]
      T3["Rollback = revert code"]
  end
  subgraph ai["AI Application Deploy"]
      A1["Code + Prompt + Model + Data → Test → Deploy"]
      A2["Behavior is non-deterministic"]
      A3["Rollback = revert code + prompt + model version + context"]
      A4["Silent degradation (no errors, just worse quality)"]
  end

  style T1 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style A1 fill:#FAEEDA,stroke:#854F0B,color:#633806
  style A4 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F

Deployment components for AI

Unlike traditional apps where you deploy “code,” AI applications have multiple independently versioned components:

  1. Application code - API routes, middleware, orchestration logic
  2. System prompts - behavioral configuration (version-controlled like code)
  3. Model version - the underlying LLM (provider-managed or self-hosted)
  4. Retrieval index - vector database content and embeddings
  5. Guardrails and filters - safety classifiers and rules
  6. Evaluation dataset - the test suite that gates deployment

Each can change independently and each change can affect output quality.

Deployment patterns

Pattern 1: Canary deployment for prompts

Roll out prompt changes to a small percentage of traffic, monitor quality, then expand:

def get_system_prompt(user_id):
    if is_in_canary(user_id, canary_percent=5):
        return PROMPT_V2  # New prompt (5% of users)
    return PROMPT_V1  # Current prompt (95% of users)

Monitor both cohorts against your eval metrics. If V2 performs equal or better after 24 hours, expand to 25%, then 50%, then 100%.

Pattern 2: Shadow deployment

Run the new version in parallel without showing results to users:

async def handle_request(query):
    # Production response (what user sees)
    response = await production_pipeline.generate(query)
    
    # Shadow response (logged but not shown)
    asyncio.create_task(shadow_pipeline.generate_and_log(query))
    
    return response

Compare shadow outputs against production using your eval suite. When shadow matches or exceeds production quality, promote it.

Pattern 3: Model version pinning

Never deploy “latest” - always pin to a specific model version:

MODEL_CONFIG = {
    "production": "gpt-4o-2024-08-06",    # Pinned, tested
    "staging": "gpt-4o-2024-11-20",        # Testing new version
    "canary": "gpt-4o-2024-11-20",         # Small % of production
}

When providers release new model versions, test in staging → canary → production before promoting.

Pattern 4: Prompt-as-code

Version control prompts alongside application code:

repo/
  src/
    app.py
  prompts/
    v1/
      system.md
      few_shot_examples.json
    v2/
      system.md
      few_shot_examples.json
  evals/
    test_cases.json

Prompt changes go through the same PR/review/CI process as code changes. The CI pipeline runs evals against the new prompt before merge.

Pattern 5: Retrieval index versioning

When you update your knowledge base, keep the old index available for rollback:

INDEX_VERSIONS = {
    "v3": {"created": "2024-03-15", "status": "production"},
    "v2": {"created": "2024-02-01", "status": "rollback_ready"},
    "v1": {"created": "2024-01-01", "status": "archived"},
}

# Switch production to new index
def promote_index(version):
    INDEX_VERSIONS[version]["status"] = "production"
    # Keep previous version available for 7 days

The deployment pipeline

Code/Prompt Change
    → Unit Tests (fast, deterministic)
    → Eval Suite (LLM-based quality checks, 5-10 min)
    → Staging Deploy (full pipeline, internal users)
    → Canary Deploy (5% of production traffic)
    → Monitor for 24h (quality metrics, error rates, user satisfaction)
    → Full Rollout (if metrics hold)
    → Monitor for 7d (long-term quality tracking)

Rollback considerations

AI rollbacks are more complex than git revert:

  • Prompt rollback: Revert to previous prompt version (fast, simple)
  • Model rollback: Switch back to pinned previous model version (fast if version is still available from provider)
  • Index rollback: Point retrieval to previous index version (fast if maintained)
  • Full rollback: All of the above simultaneously (complex, test first)

Always maintain the ability to rollback each component independently.

Real-world deployment infrastructure

  • LaunchDarkly - feature flags extended with AI-specific targeting (model routing, prompt variants)
  • Weights & Biases - experiment tracking and model registry for ML deployments
  • MLflow - model versioning and deployment management
  • Humanloop - prompt management with version control, evaluation, and deployment
  • Braintrust - prompt versioning + eval in one platform
  • Vercel AI SDK - serverless deployment with streaming support and provider routing

How to apply in practice

Never deploy prompt changes directly to 100% of users. Always canary. A typo in your system prompt can break the entire application, and AI failures are often silent (wrong answers, not errors).

Gate deployments on eval results. If your eval suite regresses by >2% on any category, block the deployment automatically. Treat eval results like test results in CI.

Monitor quality continuously, not just at deploy time. AI quality can degrade over time (model drift, stale retrieval context, changing user patterns) without any deployment happening. Set up ongoing quality monitoring with alerting.

Keep old versions running for comparison. When users report quality changes, you need to compare current vs previous version on the same inputs. Maintain shadow instances of the previous version for 7-14 days after any change.

Document what changed for every deployment. “Deployed at 3pm” is not enough. “Deployed prompt v7 (changed tone instructions) + updated retrieval index (added 50 new docs) at 3pm” enables meaningful incident analysis.

FAQ

Q: How do I handle the case where the LLM provider silently updates their model?

Pin to dated model versions (e.g., gpt-4o-2024-08-06 not gpt-4o). Run your eval suite weekly against your pinned version to detect any changes. When a new version is available, treat it like any other model change: eval in staging → canary → production. Set up alerts for unexpected quality changes that might indicate a silent update on the provider side.

Q: Traditional CI runs in seconds. AI evals take minutes. How do I keep deploy velocity?

Tiered eval strategy: (1) Fast smoke tests (10 critical cases, 30 seconds) run on every commit. (2) Full eval suite (200+ cases, 5-10 minutes) runs on PR merge to main. (3) Extended evaluation (adversarial tests, edge cases, 30+ minutes) runs nightly. Gate deploys on tier 1+2. Use tier 3 for weekly quality reports and regression detection.

Interview questions

Q: Design the deployment pipeline for an AI-powered search engine that handles 1M queries/day. A bad deployment could surface irrelevant or harmful results for millions of users.

Pipeline: (1) All changes (code, prompts, index, model) go through PR review. (2) Automated eval: 500 test queries covering relevance, safety, and freshness. Must pass >95% relevance, 0% harmful content. (3) Staging: full pipeline with synthetic traffic for 24h. Monitor relevance metrics, latency, and error rates. (4) Canary: 1% of production traffic for 48h. Compare against production baseline - relevance must not drop >1%, latency must not increase >10%. (5) Gradual rollout: 1% → 5% → 25% → 50% → 100% over 5 days with automated rollback triggers. (6) Rollback triggers: relevance drops >2%, harmful content detected, latency p99 exceeds SLA, error rate >0.1%. (7) Post-deploy: monitor for 14 days. Keep previous version warm for instant rollback. The key insight: at 1M queries/day, even 1% of traffic during canary is 10,000 queries - enough to detect quality issues statistically.