Blue-Green and Canary Deploys: Shipping Without Downtime
You deploy a new version of your API. The deployment takes 5 minutes. During those 5 minutes, some servers run the old version and some run the new version. A request hits a server running the new version. It calls another service that expects the old API format. It fails. Your users see errors.
This is the deployment problem. Blue-green and canary deployments solve it in different ways: blue-green by switching all traffic at once, canary by gradually shifting traffic while monitoring for problems.
Blue-green deployment
Blue-green deployment maintains two identical production environments: blue (current) and green (new). Traffic goes to blue. You deploy the new version to green. You test green. You switch all traffic from blue to green. If something goes wrong, you switch back to blue.
How it works:
- Blue environment runs the current version (serving all traffic)
- Deploy new version to green environment (no traffic)
- Run smoke tests on green
- Switch traffic from blue to green (DNS change, load balancer update)
- Monitor green for issues
- If issues: switch back to blue (instant rollback)
- If healthy: decommission blue (or keep as next deployment target)
graph TB subgraph before["Before Deployment"] LB1["Load Balancer 100% traffic"] --> BLUE1["Blue v1.0 All traffic"] GREEN1["Green v2.0 No traffic"] end subgraph switch["Traffic Switch"] LB2["Load Balancer"] -->|"switch"| GREEN2["Green v2.0 All traffic"] BLUE2["Blue v1.0 Standby (instant rollback)"] end style BLUE1 fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style GREEN1 fill:#F1EFE8,stroke:#888780,color:#444441 style GREEN2 fill:#E1F5EE,stroke:#0F6E56,color:#085041 style BLUE2 fill:#F1EFE8,stroke:#888780,color:#444441
Pros:
- Zero downtime (traffic switch is instant)
- Instant rollback (switch back to blue)
- Full testing before traffic switch
- No mixed versions serving traffic simultaneously
Cons:
- Requires double the infrastructure (two full environments)
- Database migrations must be backward compatible (both versions use the same database)
- Stateful sessions must be handled (users on blue lose sessions when switched to green)
Canary deployment
Canary deployment gradually shifts traffic from the old version to the new version. Start with 1% of traffic on the new version. Monitor for errors. If healthy, increase to 5%, then 25%, then 100%.
Named after the “canary in a coal mine” - a small group of users acts as the early warning system.
How it works:
- Deploy new version alongside old version
- Route 1% of traffic to new version
- Monitor error rate, latency, and business metrics
- If healthy: increase to 5%, 25%, 50%, 100%
- If issues: route 0% to new version (instant rollback)
graph LR subgraph canary["Canary Deployment Progression"] S1["Stage 1 1% canary 99% stable"] S2["Stage 2 5% canary 95% stable"] S3["Stage 3 25% canary 75% stable"] S4["Stage 4 100% canary 0% stable"] S1 -->|"healthy"| S2 S2 -->|"healthy"| S3 S3 -->|"healthy"| S4 S2 -->|"issues detected"| ROLL["Rollback 0% canary"] end style S1 fill:#FAEEDA,stroke:#854F0B,color:#633806 style S2 fill:#FAEEDA,stroke:#854F0B,color:#633806 style S3 fill:#E1F5EE,stroke:#0F6E56,color:#085041 style S4 fill:#E1F5EE,stroke:#0F6E56,color:#085041 style ROLL fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
Pros:
- Limits blast radius (only 1% of users affected by a bad deploy)
- Real production traffic validates the new version
- Gradual rollout gives time to detect subtle issues
- No double infrastructure cost (canary runs alongside stable)
Cons:
- Mixed versions serving traffic simultaneously (API compatibility required)
- Slower rollout (minutes to hours vs seconds for blue-green)
- More complex traffic routing
- Harder to test database migrations (both versions use the same database)
Rolling deployment
A simpler variant: replace instances one at a time. Take one instance out of rotation, deploy the new version, bring it back. Repeat for all instances.
Pros: Simple, no extra infrastructure.
Cons: Mixed versions during deployment, no instant rollback, slower than blue-green.
Where it breaks or gets interesting
Database migrations
The hardest part of zero-downtime deployments. Both the old and new versions must work with the same database schema during the transition.
The expand-contract pattern:
- Expand: add new columns/tables (backward compatible, old version ignores them)
- Deploy new version (both versions work with the expanded schema)
- Contract: remove old columns/tables (after old version is fully replaced)
Never deploy a schema change that breaks the old version while the old version is still running.
Session handling in blue-green
When you switch from blue to green, users with active sessions on blue lose their sessions. Solutions:
- Use a shared session store (Redis) that both blue and green can access
- Use stateless tokens (JWTs) that do not require a session store
- Accept brief session loss for non-critical applications
Feature flags as an alternative
Feature flags let you deploy code without activating it. Deploy the new feature to all servers, but keep it disabled. Enable it for 1% of users (canary), then gradually increase. Rollback is instant (disable the flag). No infrastructure changes needed.
Feature flags are more flexible than canary deployments but require more application code changes.
Canary analysis automation
Manual canary analysis (watching dashboards) is error-prone. Automate it: compare error rate, latency, and business metrics between canary and stable. If the canary is statistically worse, automatically roll back. Tools: Spinnaker (automated canary analysis), Argo Rollouts (Kubernetes), Flagger (Kubernetes).
Real-world systems
Netflix - Uses canary deployments for all production changes. Automated canary analysis compares metrics between canary and baseline. Spinnaker orchestrates the deployment pipeline.
Facebook - Uses a staged rollout: internal employees first, then 1% of users, then 10%, then 100%. Each stage has automated checks.
Google - Uses canary deployments with automated analysis. Changes are rolled out gradually across their global infrastructure.
AWS - CodeDeploy supports blue-green and canary deployments. ECS and Lambda support traffic shifting.
Kubernetes - Supports rolling updates natively. Argo Rollouts and Flagger add blue-green and canary support.
How to apply it in practice
Choosing a strategy
Use blue-green when:
- You need instant rollback
- You can afford double infrastructure
- Your deployment is infrequent (weekly, monthly)
- You need to test the full environment before switching
Use canary when:
- You deploy frequently (multiple times per day)
- You want to limit blast radius
- You have good monitoring to detect issues
- Your infrastructure cannot support double capacity
Use rolling when:
- Simplicity is more important than zero-downtime
- You have many instances and can tolerate brief mixed versions
Monitoring during deployment
During a canary deployment, monitor:
- Error rate (canary vs stable)
- Latency (canary vs stable)
- Business metrics (conversion rate, order completion rate)
- Infrastructure metrics (CPU, memory)
Set automatic rollback triggers: if canary error rate is 2x the stable error rate, automatically roll back.
The deployment pipeline
A typical deployment pipeline:
- Build and test (CI)
- Deploy to staging (full environment)
- Run integration tests
- Deploy to production canary (1%)
- Automated canary analysis (15-30 minutes)
- Gradual traffic increase (5%, 25%, 100%)
- Monitor for 24 hours
- Decommission old version
FAQ
Q: What is the difference between blue-green and canary?
Blue-green switches all traffic at once (instant, but requires double infrastructure). Canary gradually shifts traffic (slower, but limits blast radius). Blue-green is better for infrequent, high-risk deployments. Canary is better for frequent deployments where you want to catch issues early with minimal user impact.
Q: How do you handle database migrations with blue-green deployments?
Use the expand-contract pattern. Before the blue-green switch: run the migration to add new columns (backward compatible). Both blue and green work with the expanded schema. After the switch: run the migration to remove old columns (after blue is decommissioned). Never run a migration that breaks the running version. This requires multiple deployments for schema changes but ensures zero downtime.
Q: Can you use canary deployments for database changes?
Not directly. Database changes affect all versions simultaneously. Use feature flags instead: deploy the code that uses the new schema, but keep it disabled. Run the schema migration. Enable the feature flag for 1% of users (canary). Gradually increase. This gives you canary-like behavior for database changes.
Interview questions
Q1: You are deploying a new version of your API that changes the response format of a critical endpoint. How do you deploy this without downtime?
Strong answer: Use the expand-contract pattern with a canary deployment. Phase 1: deploy a version that returns both the old and new response format (backward compatible). The new field is added alongside the old field. Phase 2: canary deploy this version to 1% of traffic. Verify clients can handle both formats. Phase 3: gradually increase to 100%. Phase 4: deploy a version that removes the old field. Canary deploy again. This requires two deployments but ensures no client breaks. Alternatively, use API versioning: the new format is at /v2/endpoint. Clients migrate to v2 at their own pace. The old v1 endpoint is deprecated and eventually removed.
Q2: Your canary deployment shows a 0.5% increase in error rate compared to the stable version. Do you roll back?
Strong answer: It depends on statistical significance and business impact. A 0.5% increase might be noise (random variation) or a real regression. Calculate statistical significance: with enough traffic, even a 0.1% difference is statistically significant. Check the absolute numbers: if stable has 0.1% error rate and canary has 0.6%, that is a 6x increase - significant. If stable has 5% error rate and canary has 5.5%, that is a 10% relative increase - less concerning. Check the error types: are they new errors or existing ones? Check business metrics: is conversion rate affected? If the increase is statistically significant and affects business metrics, roll back. If it is within normal variation, continue monitoring. Automated canary analysis tools (Kayenta, Argo Rollouts) handle this statistical analysis automatically.
Q3: How does Kubernetes support zero-downtime deployments?
Strong answer: Kubernetes supports rolling updates natively. When you update a Deployment, Kubernetes replaces pods one at a time (or in batches, controlled by maxSurge and maxUnavailable). New pods must pass readiness checks before old pods are terminated. This ensures there is always a minimum number of healthy pods serving traffic. For blue-green: create a new Deployment with the new version, switch the Service selector from the old Deployment to the new one (instant traffic switch), then delete the old Deployment. For canary: use two Deployments (stable and canary) with different replica counts. A Service routes traffic proportionally based on replica count. Argo Rollouts and Flagger extend this with automated canary analysis and traffic splitting using Istio or nginx.