Blue-Green and Canary Deploys: Shipping Without Downtime

You deploy a new version of your API. The deployment takes 5 minutes. During those 5 minutes, some servers run the old version and some run the new version. A request hits a server running the new version. It calls another service that expects the old API format. It fails. Your users see errors.

This is the deployment problem. Blue-green and canary deployments solve it in different ways: blue-green by switching all traffic at once, canary by gradually shifting traffic while monitoring for problems.

Blue-green deployment

Blue-green deployment maintains two identical production environments: blue (current) and green (new). Traffic goes to blue. You deploy the new version to green. You test green. You switch all traffic from blue to green. If something goes wrong, you switch back to blue.

How it works:

Blue environment runs the current version (serving all traffic)
Deploy new version to green environment (no traffic)
Run smoke tests on green
Switch traffic from blue to green (DNS change, load balancer update)
Monitor green for issues
If issues: switch back to blue (instant rollback)
If healthy: decommission blue (or keep as next deployment target)

graph TB
subgraph before["Before Deployment"]
  LB1["Load Balancer
100% traffic"] --> BLUE1["Blue
v1.0
All traffic"]
  GREEN1["Green
v2.0
No traffic"]
end

subgraph switch["Traffic Switch"]
  LB2["Load Balancer"] -->|"switch"| GREEN2["Green
v2.0
All traffic"]
  BLUE2["Blue
v1.0
Standby
(instant rollback)"]
end

style BLUE1 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style GREEN1 fill:#F1EFE8,stroke:#888780,color:#444441
style GREEN2 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style BLUE2 fill:#F1EFE8,stroke:#888780,color:#444441

Pros:

Zero downtime (traffic switch is instant)
Instant rollback (switch back to blue)
Full testing before traffic switch
No mixed versions serving traffic simultaneously

Cons:

Requires double the infrastructure (two full environments)
Database migrations must be backward compatible (both versions use the same database)
Stateful sessions must be handled (users on blue lose sessions when switched to green)

Canary deployment

Canary deployment gradually shifts traffic from the old version to the new version. Start with 1% of traffic on the new version. Monitor for errors. If healthy, increase to 5%, then 25%, then 100%.

Named after the “canary in a coal mine” - a small group of users acts as the early warning system.

How it works:

Deploy new version alongside old version
Route 1% of traffic to new version
Monitor error rate, latency, and business metrics
If healthy: increase to 5%, 25%, 50%, 100%
If issues: route 0% to new version (instant rollback)

graph LR
subgraph canary["Canary Deployment Progression"]
  S1["Stage 1
1% canary
99% stable"]
  S2["Stage 2
5% canary
95% stable"]
  S3["Stage 3
25% canary
75% stable"]
  S4["Stage 4
100% canary
0% stable"]
  S1 -->|"healthy"| S2
  S2 -->|"healthy"| S3
  S3 -->|"healthy"| S4
  S2 -->|"issues detected"| ROLL["Rollback
0% canary"]
end

style S1 fill:#FAEEDA,stroke:#854F0B,color:#633806
style S2 fill:#FAEEDA,stroke:#854F0B,color:#633806
style S3 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style S4 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style ROLL fill:#FCEBEB,stroke:#A32D2D,color:#791F1F

Pros:

Limits blast radius (only 1% of users affected by a bad deploy)
Real production traffic validates the new version
Gradual rollout gives time to detect subtle issues
No double infrastructure cost (canary runs alongside stable)

Cons:

Mixed versions serving traffic simultaneously (API compatibility required)
Slower rollout (minutes to hours vs seconds for blue-green)
More complex traffic routing
Harder to test database migrations (both versions use the same database)

Rolling deployment

A simpler variant: replace instances one at a time. Take one instance out of rotation, deploy the new version, bring it back. Repeat for all instances.

Pros: Simple, no extra infrastructure.

Cons: Mixed versions during deployment, no instant rollback, slower than blue-green.

Where it breaks or gets interesting

Database migrations

The hardest part of zero-downtime deployments. Both the old and new versions must work with the same database schema during the transition.

The expand-contract pattern:

Expand: add new columns/tables (backward compatible, old version ignores them)
Deploy new version (both versions work with the expanded schema)
Contract: remove old columns/tables (after old version is fully replaced)

Never deploy a schema change that breaks the old version while the old version is still running.

Session handling in blue-green

When you switch from blue to green, users with active sessions on blue lose their sessions. Solutions:

Use a shared session store (Redis) that both blue and green can access
Use stateless tokens (JWTs) that do not require a session store
Accept brief session loss for non-critical applications

Feature flags as an alternative

Feature flags let you deploy code without activating it. Deploy the new feature to all servers, but keep it disabled. Enable it for 1% of users (canary), then gradually increase. Rollback is instant (disable the flag). No infrastructure changes needed.

Feature flags are more flexible than canary deployments but require more application code changes.

Canary analysis automation

Manual canary analysis (watching dashboards) is error-prone. Automate it: compare error rate, latency, and business metrics between canary and stable. If the canary is statistically worse, automatically roll back. Tools: Spinnaker (automated canary analysis), Argo Rollouts (Kubernetes), Flagger (Kubernetes).

Real-world systems

Netflix - Uses canary deployments for all production changes. Automated canary analysis compares metrics between canary and baseline. Spinnaker orchestrates the deployment pipeline.

Facebook - Uses a staged rollout: internal employees first, then 1% of users, then 10%, then 100%. Each stage has automated checks.

Google - Uses canary deployments with automated analysis. Changes are rolled out gradually across their global infrastructure.

AWS - CodeDeploy supports blue-green and canary deployments. ECS and Lambda support traffic shifting.

Kubernetes - Supports rolling updates natively. Argo Rollouts and Flagger add blue-green and canary support.

How to apply it in practice

Choosing a strategy

Use blue-green when:

You need instant rollback
You can afford double infrastructure
Your deployment is infrequent (weekly, monthly)
You need to test the full environment before switching

Use canary when:

You deploy frequently (multiple times per day)
You want to limit blast radius
You have good monitoring to detect issues
Your infrastructure cannot support double capacity

Use rolling when:

Simplicity is more important than zero-downtime
You have many instances and can tolerate brief mixed versions

Monitoring during deployment

During a canary deployment, monitor:

Error rate (canary vs stable)
Latency (canary vs stable)
Business metrics (conversion rate, order completion rate)
Infrastructure metrics (CPU, memory)

Set automatic rollback triggers: if canary error rate is 2x the stable error rate, automatically roll back.

The deployment pipeline

A typical deployment pipeline:

Build and test (CI)
Deploy to staging (full environment)
Run integration tests
Deploy to production canary (1%)
Automated canary analysis (15-30 minutes)
Gradual traffic increase (5%, 25%, 100%)
Monitor for 24 hours
Decommission old version

FAQ

Q: What is the difference between blue-green and canary?

Blue-green switches all traffic at once (instant, but requires double infrastructure). Canary gradually shifts traffic (slower, but limits blast radius). Blue-green is better for infrequent, high-risk deployments. Canary is better for frequent deployments where you want to catch issues early with minimal user impact.

Q: How do you handle database migrations with blue-green deployments?

Use the expand-contract pattern. Before the blue-green switch: run the migration to add new columns (backward compatible). Both blue and green work with the expanded schema. After the switch: run the migration to remove old columns (after blue is decommissioned). Never run a migration that breaks the running version. This requires multiple deployments for schema changes but ensures zero downtime.

Q: Can you use canary deployments for database changes?

Not directly. Database changes affect all versions simultaneously. Use feature flags instead: deploy the code that uses the new schema, but keep it disabled. Run the schema migration. Enable the feature flag for 1% of users (canary). Gradually increase. This gives you canary-like behavior for database changes.

Interview questions

Q1: You are deploying a new version of your API that changes the response format of a critical endpoint. How do you deploy this without downtime?

Strong answer: Use the expand-contract pattern with a canary deployment. Phase 1: deploy a version that returns both the old and new response format (backward compatible). The new field is added alongside the old field. Phase 2: canary deploy this version to 1% of traffic. Verify clients can handle both formats. Phase 3: gradually increase to 100%. Phase 4: deploy a version that removes the old field. Canary deploy again. This requires two deployments but ensures no client breaks. Alternatively, use API versioning: the new format is at /v2/endpoint. Clients migrate to v2 at their own pace. The old v1 endpoint is deprecated and eventually removed.

Q2: Your canary deployment shows a 0.5% increase in error rate compared to the stable version. Do you roll back?

Strong answer: It depends on statistical significance and business impact. A 0.5% increase might be noise (random variation) or a real regression. Calculate statistical significance: with enough traffic, even a 0.1% difference is statistically significant. Check the absolute numbers: if stable has 0.1% error rate and canary has 0.6%, that is a 6x increase - significant. If stable has 5% error rate and canary has 5.5%, that is a 10% relative increase - less concerning. Check the error types: are they new errors or existing ones? Check business metrics: is conversion rate affected? If the increase is statistically significant and affects business metrics, roll back. If it is within normal variation, continue monitoring. Automated canary analysis tools (Kayenta, Argo Rollouts) handle this statistical analysis automatically.

Q3: How does Kubernetes support zero-downtime deployments?

Strong answer: Kubernetes supports rolling updates natively. When you update a Deployment, Kubernetes replaces pods one at a time (or in batches, controlled by maxSurge and maxUnavailable). New pods must pass readiness checks before old pods are terminated. This ensures there is always a minimum number of healthy pods serving traffic. For blue-green: create a new Deployment with the new version, switch the Service selector from the old Deployment to the new one (instant traffic switch), then delete the old Deployment. For canary: use two Deployments (stable and canary) with different replica counts. A Service routes traffic proportionally based on replica count. Argo Rollouts and Flagger extend this with automated canary analysis and traffic splitting using Istio or nginx.