Build a Blue-Green Deployment System


deployment reliability devops

System Design Deep Dive

Blue-Green Deployment System

Switch 100% of production traffic in under 30 seconds - and roll back in under 5 if anything goes wrong.

⏱ 14 min read📐 Advanced🏗️ Deployment

Imagine a hospital with two identical operating rooms side by side. When a procedure needs a new piece of equipment, the surgical team sets up the second room completely, runs a full checklist with dummy patients, then wheels the real patient through in one smooth handoff - with the old room standing by as a backup in case anything goes wrong. That is blue-green deployment: maintain two identical production environments, promote the new one when it’s healthy, and keep the old one hot enough to flip back to within seconds.

The challenge isn’t the concept - it’s the precision engineering underneath it. Traffic routing must be weight-based and instantaneous, not DNS-based and slow. The new environment must pass real health checks, not just “pod is running.” Existing user sessions can’t be dropped mid-flight. Databases shared by both environments must support both schema versions simultaneously. And the entire promotion sequence from 0% to 100% must complete in under 30 seconds with automatic rollback triggered the moment error rates cross a threshold.

A naive approach - stop the old version, start the new one - gives you zero downtime only in theory. In practice you get a deployment window where requests either queue (causing latency spikes) or fail (causing 503s). Even “rolling updates” in Kubernetes have a window where old and new versions coexist, with no clean boundary to roll back across. The real problem is that you need atomic traffic switching at the load balancer level, health-gated promotion, and a state machine that handles the messy middle states: what happens if green becomes unhealthy at 50% traffic? What if a database migration breaks backward compatibility? What if the health check itself is wrong?

We need to solve for sub-30-second full traffic switching, instant rollback with no traffic loss, graceful session draining, database compatibility across both environments, and a deployments state machine that survives partial failures and controller crashes. Let’s build it.

Requirements and Constraints

Functional Requirements

  • Deploy a new application version (green) alongside the current version (blue) without any downtime
  • Shift production traffic from blue to green progressively: starting at 0%, then canary (1-10%), then full (100%)
  • Complete the full traffic shift in under 30 seconds once smoke tests pass
  • Roll back to the previous environment within 5 seconds if error rates exceed a configurable threshold
  • Run automated smoke tests against the green environment before any live traffic is sent
  • Support graceful session draining - existing connections to blue complete before blue is decommissioned
  • Persist deployment state so the controller can resume if it crashes mid-deployment
  • Support multi-region deployments with per-region rollout sequencing

Non-Functional Requirements

  • Traffic switching latency: weight changes propagate to all load balancer nodes within 2 seconds
  • Rollback trigger time: from error rate breach to 0% green traffic in under 5 seconds
  • Health check frequency: every 2 seconds during active deployment, every 30 seconds in steady state
  • Availability: 99.99% for the deployment controller itself (it’s in the blast radius of every deploy)
  • Concurrent deployments: at most one active deployment per service at a time, globally enforced via distributed lock
  • Session drain timeout: configurable, default 30 seconds before forceful termination of old connections

Constraints and Assumptions

  • Services are stateless at the application tier; session state lives in Redis or a database, not in process memory
  • Database migrations must be backward-compatible (expand-contract pattern); the system doesn’t manage DB migrations
  • Blue environment is kept hot for at least 1 hour after promotion before teardown - instant rollback window
  • The system targets Kubernetes-native or ECS-based infrastructure; bare-metal deployments are out of scope
  • No blue-green across heterogeneous infrastructure (e.g., mixing Lambda and containers) in a single deployment

High-Level Architecture

The system has five major components working in concert. The Deployment Controller is the brain - a state machine that drives the entire lifecycle from provisioning to promotion or rollback. The Traffic Splitter sits at the load balancer layer (Nginx/Envoy) and applies weight-based routing rules in real time. The Health Checker continuously probes both environments and feeds signals back to the controller. The Metrics Store holds error rate, latency, and SLO data that the rollback decision algorithm reads. The Deployment State Store (etcd or Postgres) provides distributed locking and durable state so controller restarts don’t lose progress.

Blue-green deployment system architecture overview showing the traffic router layer, blue and green environments, health checker, metrics store, and deployment state store

Traffic enters through the DNS layer or a global load balancer that routes to a regional Traffic Splitter. The splitter holds upstream definitions for both blue and green backends, with weight values updated dynamically by the controller. During steady state, green weight is 0 - all traffic hits blue. When a deployment starts, the controller provisions green, waits for health checks to stabilize, runs smoke tests, then begins incrementing green’s weight in steps. The Health Checker sends synthetic probes every 2 seconds and the Metrics Store scrapes real traffic error rates from both upstreams. If either signal goes red during a shift, the controller flips back to 100% blue in a single atomic weight update.

Key Insight

The entire rollback mechanism is a single Nginx/Envoy upstream weight update - no pod restarts, no DNS TTL wait, no deployment rollback command. The old environment never stopped; rolling back is just changing a number from 100 to 0 on the green upstream.

The Deployment Controller

The Deployment Controller is the central authority for every state transition in a deployment. Think of it as an air traffic controller: it has the full picture of where every flight (environment) is, enforces sequencing, and can divert a flight (roll back) the moment a runway issue is detected.

The controller is implemented as a state machine with seven states: IDLE, PROVISIONING, SMOKE_TESTING, CANARY, SHIFTING, ACTIVE, and ROLLING_BACK. Each state has allowed transitions, a timeout, and a recovery action if the timeout fires. The state is persisted to the State Store before any action is taken - this is the commit-before-execute pattern that makes the controller crash-safe.

Deployment state machine showing all states from IDLE through PROVISIONING, SMOKE_TESTING, SHIFTING phases, to ACTIVE or ROLLING_BACK, with error paths
# Deployment controller core state machine - crash-safe with pre-commit pattern
import enum
import time
import uuid
from dataclasses import dataclass, field
from typing import Optional

class DeployState(enum.Enum):
    IDLE = "idle"
    PROVISIONING = "provisioning"
    SMOKE_TESTING = "smoke_testing"
    CANARY = "canary"
    SHIFTING = "shifting"
    ACTIVE = "active"
    ROLLING_BACK = "rolling_back"
    FAILED = "failed"

@dataclass
class Deployment:
    deploy_id: str
    service_id: str
    blue_version: str
    green_version: str
    state: DeployState
    green_weight: int  # 0-100
    created_at: float
    updated_at: float
    error_threshold: float = 0.01   # 1% error rate triggers rollback
    rollback_reason: Optional[str] = None

class DeploymentController:
    def __init__(self, state_store, traffic_splitter, health_checker, metrics):
        self.state_store = state_store
        self.splitter = traffic_splitter
        self.health = health_checker
        self.metrics = metrics

    def transition(self, deploy: Deployment, new_state: DeployState, **kwargs) -> Deployment:
        # Commit state BEFORE taking action - crash-safe
        deploy.state = new_state
        deploy.updated_at = time.time()
        for k, v in kwargs.items():
            setattr(deploy, k, v)
        self.state_store.save(deploy)  # atomic upsert
        return deploy

    def run_deployment(self, service_id: str, green_version: str) -> Deployment:
        # Acquire distributed lock - only one active deploy per service
        lock_key = f"deploy_lock:{service_id}"
        with self.state_store.lock(lock_key, ttl=1800):
            existing = self.state_store.get_active(service_id)
            if existing and existing.state != DeployState.IDLE:
                raise RuntimeError(f"Active deployment {existing.deploy_id} in state {existing.state}")

            deploy = Deployment(
                deploy_id=str(uuid.uuid4()),
                service_id=service_id,
                blue_version=existing.green_version if existing else "unknown",
                green_version=green_version,
                state=DeployState.IDLE,
                green_weight=0,
                created_at=time.time(),
                updated_at=time.time(),
            )

            # Phase 1: Provision green environment
            deploy = self.transition(deploy, DeployState.PROVISIONING)
            self._provision_green(deploy)
            self._wait_for_green_ready(deploy, timeout=300)

            # Phase 2: Run smoke tests at 0% live traffic
            deploy = self.transition(deploy, DeployState.SMOKE_TESTING)
            self._run_smoke_tests(deploy)

            # Phase 3: Canary - 5% live traffic, observe for 30s
            deploy = self.transition(deploy, DeployState.CANARY, green_weight=5)
            self.splitter.set_weight(service_id, "green", 5)
            self._observe_and_check(deploy, duration=30)

            # Phase 4: Progressive shift to 100%
            deploy = self.transition(deploy, DeployState.SHIFTING)
            for weight in [20, 50, 80, 100]:
                deploy = self.transition(deploy, DeployState.SHIFTING, green_weight=weight)
                self.splitter.set_weight(service_id, "green", weight)
                self._observe_and_check(deploy, duration=5)

            # Phase 5: Promote - green is now active
            deploy = self.transition(deploy, DeployState.ACTIVE)
            self._schedule_blue_teardown(deploy, delay_seconds=3600)
            return deploy

    def _observe_and_check(self, deploy: Deployment, duration: int):
        deadline = time.time() + duration
        while time.time() < deadline:
            err_rate = self.metrics.error_rate(deploy.service_id, window_seconds=60)
            if err_rate > deploy.error_threshold:
                self._rollback(deploy, reason=f"error rate {err_rate:.3%} exceeded {deploy.error_threshold:.3%}")
                raise RuntimeError("Deployment rolled back due to error rate spike")
            time.sleep(2)

    def _rollback(self, deploy: Deployment, reason: str):
        deploy = self.transition(deploy, DeployState.ROLLING_BACK, rollback_reason=reason)
        self.splitter.set_weight(deploy.service_id, "green", 0)  # instant flip
        self.transition(deploy, DeployState.IDLE)
Watch Out

Never use DNS-level switching as your primary rollback mechanism. DNS TTLs range from 30 seconds to 5 minutes and client-side caching is unpredictable. By the time DNS propagates your rollback, you’ve already served thousands of failed requests. The rollback must happen at the load balancer upstream weight layer, which propagates in under 2 seconds.

The Traffic Splitter

The Traffic Splitter is the enforcement layer - it’s the component that actually controls which backend each request hits. Its job is deceptively simple: given a weight for each upstream, route traffic proportionally. The hard parts are atomicity (weight updates must be consistent across all load balancer replicas simultaneously), persistence (weights survive an Nginx reload), and observability (you need per-upstream metrics to catch problems early).

The canonical implementation uses Nginx’s split_clients directive or Envoy’s weighted cluster configuration. Both support live updates without dropping connections: Nginx via nginx -s reload (which does a graceful worker handoff), Envoy via its xDS management API (which applies changes without any reload at all).

# Nginx upstream config for blue-green - rendered by the controller and hot-reloaded
upstream blue_backend {
    server blue-app-1:8080 weight=100;
    server blue-app-2:8080 weight=100;
    server blue-app-3:8080 weight=100;
    keepalive 32;
}

upstream green_backend {
    server green-app-1:8080 weight=100;
    server green-app-2:8080 weight=100;
    server green-app-3:8080 weight=100;
    keepalive 32;
}

# split_clients uses a murmur hash on $request_id for consistent per-request routing
split_clients "${request_id}" $backend {
    5%      green_backend;   # updated by controller: 5% during canary
    *       blue_backend;
}

server {
    listen 80;
    location / {
        proxy_pass http://$backend;
        proxy_set_header X-Routed-To $backend;  # for debugging
        proxy_next_upstream error timeout;       # retry on the same upstream, not the other
    }
}

For Envoy-based infrastructure (service meshes, AWS App Mesh, Istio), the equivalent is a VirtualService weighted route. The advantage of Envoy is that weight changes apply via the xDS API in milliseconds with zero connection drops - no reload signal needed.

# Istio VirtualService for blue-green traffic splitting - applied by the controller
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: my-service
spec:
  http:
  - route:
    - destination:
        host: my-service-blue
        port:
          number: 8080
      weight: 95    # controller updates this field
    - destination:
        host: my-service-green
        port:
          number: 8080
      weight: 5     # and this one
Real World

AWS CodeDeploy implements blue-green for ECS using an Application Load Balancer with two target groups. During deployment, CodeDeploy shifts the ALB listener rule weights between the blue and green target groups over a configurable duration. The rollback is a single API call that flips the listener rule back - it propagates in under 2 seconds across all ALB nodes in a region.

Health Checking and Smoke Testing

Most engineers assume a health check is a GET /healthz that returns 200. At the scale of a production deployment system, that’s necessary but nowhere near sufficient. Smoke testing is the gating mechanism that ensures green is functionally correct before any live traffic touches it.

Smoke tests are synthetic workloads that exercise the most critical paths of the application: authentication, primary write path, primary read path, and any external dependency calls (payment processor, notification service). They run against green at 0% live traffic weight, using internal routing to bypass the Traffic Splitter entirely.

# Smoke test runner - executes before any live traffic hits green
import httpx
import asyncio
from dataclasses import dataclass
from typing import Callable, List

@dataclass
class SmokeTest:
    name: str
    fn: Callable
    timeout_seconds: float = 10.0
    required: bool = True   # required=False tests are advisory only

class SmokeTestRunner:
    def __init__(self, green_base_url: str, internal_token: str):
        self.base_url = green_base_url
        self.headers = {"X-Internal-Token": internal_token, "X-Smoke-Test": "true"}

    async def run_all(self, tests: List[SmokeTest]) -> bool:
        results = await asyncio.gather(*[
            self._run_one(t) for t in tests
        ], return_exceptions=True)

        failed_required = [
            t.name for t, r in zip(tests, results)
            if t.required and (isinstance(r, Exception) or r is False)
        ]
        if failed_required:
            raise RuntimeError(f"Smoke tests failed: {failed_required}")
        return True

    async def _run_one(self, test: SmokeTest) -> bool:
        try:
            async with asyncio.timeout(test.timeout_seconds):
                return await test.fn(self.base_url, self.headers)
        except Exception as e:
            if test.required:
                raise
            return False

# Example smoke tests for a typical web service
async def test_health(base_url: str, headers: dict) -> bool:
    async with httpx.AsyncClient() as client:
        r = await client.get(f"{base_url}/healthz", headers=headers)
        return r.status_code == 200 and r.json().get("status") == "ok"

async def test_auth_flow(base_url: str, headers: dict) -> bool:
    async with httpx.AsyncClient() as client:
        # Real auth flow with a test account - not just a ping
        r = await client.post(f"{base_url}/auth/token",
                              json={"client_id": "smoke-test", "client_secret": "smoke-secret"},
                              headers=headers)
        return r.status_code == 200 and "access_token" in r.json()

async def test_write_path(base_url: str, headers: dict) -> bool:
    async with httpx.AsyncClient() as client:
        r = await client.post(f"{base_url}/api/v1/events",
                              json={"type": "smoke_test", "payload": {}},
                              headers=headers)
        return r.status_code in (200, 201)
Key Insight

Smoke tests must hit the green environment through its internal address, bypassing the Traffic Splitter entirely. If smoke tests go through the splitter at 0% green weight, they never reach green. Use a direct internal URL (e.g., http://green-service.internal:8080) injected into the smoke test runner by the controller.

Session Draining

Session draining is the grace period between “we’ve shifted 100% of new traffic to green” and “we tear down blue.” It’s analogous to a bank branch closing: new customers go to the new branch immediately, but the old branch stays open until everyone who was already inside finishes their transaction.

Without session draining, long-running requests (uploads, streaming responses, WebSocket connections) that were established on blue get terminated mid-flight when the blue pods are killed. The user sees a broken transfer, not a clean switch.

The implementation has two parts. First, the Traffic Splitter stops routing new connections to blue (weight = 0) while keeping existing connections alive. In Nginx, this happens automatically via keepalive connection tracking. In Kubernetes, the preStop lifecycle hook delays pod termination.

# Kubernetes pod spec for graceful termination during blue teardown
apiVersion: apps/v1
kind: Deployment
metadata:
  name: blue-app
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 60  # max time to drain
      containers:
      - name: app
        lifecycle:
          preStop:
            exec:
              # Signal the app to stop accepting new connections,
              # then wait for existing ones to drain
              command: ["/bin/sh", "-c", "sleep 5 && kill -SIGTERM 1"]
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          periodSeconds: 2
          failureThreshold: 3  # remove from LB after 3 failures

The controller tracks drain completion by monitoring the connection count on blue’s upstream in Nginx (stub_status module) or Envoy’s admin API. When active connections drop to 0, or the drain timeout fires, the controller proceeds to teardown.

# Session drain monitor - waits until blue has no active connections
import time
import httpx

def wait_for_drain(nginx_status_url: str, upstream_name: str,
                   timeout: int = 30, poll_interval: float = 2.0) -> bool:
    deadline = time.time() + timeout
    while time.time() < deadline:
        try:
            r = httpx.get(nginx_status_url, timeout=3)
            # Parse Nginx stub_status active connections
            # In production, use Prometheus nginx_upstream_active metric instead
            active = parse_active_connections(r.text, upstream_name)
            if active == 0:
                return True
        except Exception:
            pass  # nginx may be mid-reload
        time.sleep(poll_interval)
    return False  # timeout expired - proceed with forceful teardown

def parse_active_connections(stub_status_text: str, upstream: str) -> int:
    # Parse "Active connections: N" from Nginx stub_status output
    for line in stub_status_text.splitlines():
        if line.startswith("Active connections:"):
            return int(line.split(":")[1].strip())
    return 0
Watch Out

Never set terminationGracePeriodSeconds shorter than your longest legitimate request duration. If your service supports file uploads and the max upload takes 45 seconds, your grace period must be at least 60 seconds. A 30-second grace period will silently kill 45-second uploads during every blue teardown.

Data Model

The State Store is the single source of truth for every deployment’s current status. It must support atomic transitions (compare-and-swap), distributed locking (one controller instance wins), and audit history (replay what happened during an incident).

-- Deployment state store schema (Postgres with optimistic locking)
CREATE TABLE deployments (
    deploy_id       UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    service_id      TEXT NOT NULL,
    blue_version    TEXT NOT NULL,
    green_version   TEXT NOT NULL,
    state           TEXT NOT NULL CHECK (state IN (
                        'idle','provisioning','smoke_testing',
                        'canary','shifting','active','rolling_back','failed'
                    )),
    green_weight    SMALLINT NOT NULL DEFAULT 0 CHECK (green_weight BETWEEN 0 AND 100),
    error_threshold NUMERIC(5,4) NOT NULL DEFAULT 0.01,
    rollback_reason TEXT,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    version         BIGINT NOT NULL DEFAULT 1,  -- optimistic lock version
    CONSTRAINT one_active_per_service EXCLUDE USING btree (
        service_id WITH =
    ) WHERE (state NOT IN ('idle', 'active', 'failed'))
);

CREATE UNIQUE INDEX idx_deployments_service_active
    ON deployments (service_id)
    WHERE state NOT IN ('idle', 'active', 'failed');

CREATE INDEX idx_deployments_service_state
    ON deployments (service_id, state, updated_at DESC);

-- Audit log: every state transition is recorded immutably
CREATE TABLE deployment_events (
    event_id        UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    deploy_id       UUID NOT NULL REFERENCES deployments(deploy_id),
    service_id      TEXT NOT NULL,
    from_state      TEXT,
    to_state        TEXT NOT NULL,
    green_weight    SMALLINT NOT NULL,
    controller_id   TEXT NOT NULL,  -- which controller instance made this change
    metadata        JSONB,          -- extra context: error rate, smoke test results
    occurred_at     TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_events_deploy ON deployment_events (deploy_id, occurred_at DESC);

-- Traffic weights table - source of truth for the Traffic Splitter
CREATE TABLE traffic_weights (
    service_id      TEXT NOT NULL,
    environment     TEXT NOT NULL CHECK (environment IN ('blue', 'green')),
    weight          SMALLINT NOT NULL DEFAULT 0 CHECK (weight BETWEEN 0 AND 100),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (service_id, environment)
);

The optimistic lock via the version column prevents split-brain scenarios where two controller instances both believe they own a deployment. Before any state transition, the controller does:

-- Atomic state transition with optimistic lock check
UPDATE deployments
SET state = $new_state,
    green_weight = $new_weight,
    updated_at = NOW(),
    version = version + 1
WHERE deploy_id = $deploy_id
  AND version = $expected_version;
-- If 0 rows updated: another controller won the race, abort and re-read
Key Insight

The EXCLUDE USING btree constraint on the deployments table is a database-enforced guard against concurrent deployments on the same service. It’s more reliable than application-level locking because it survives controller crashes and network partitions - the database is the arbiter.

Key Algorithms and Protocols

Weighted Traffic Routing

The Traffic Splitter must distribute requests across two upstreams with configurable weights. A naive modulo approach (request_count % 100 < green_weight) has a session affinity problem: sequential requests from the same user could land on different environments within a single page load if the modulo boundary is crossed.

The correct approach uses consistent hashing on a request identifier - either a session cookie, user ID, or a random X-Request-ID generated at the edge. This guarantees that a given user always hits the same environment during a deployment, which is critical for correctness when the two versions have different API responses.

# Consistent hash-based traffic splitter for session-stable routing
import hashlib
import struct

def route_request(session_id: str, green_weight: int) -> str:
    """
    Returns 'green' or 'blue' for a given session, deterministically.
    green_weight is 0-100 (percentage of traffic to route to green).
    Uses FNV-1a hash for speed and good distribution.
    """
    if green_weight <= 0:
        return "blue"
    if green_weight >= 100:
        return "green"

    # FNV-1a 32-bit hash: fast, well-distributed, no crypto overhead
    h = 2166136261  # FNV offset basis
    for byte in session_id.encode("utf-8"):
        h ^= byte
        h = (h * 16777619) & 0xFFFFFFFF  # FNV prime, mod 2^32

    # Map hash to [0, 100) range
    bucket = h % 100
    return "green" if bucket < green_weight else "blue"

# Time complexity: O(len(session_id)), Space: O(1)
# At green_weight=5: exactly 5% of the hash space maps to green
# Same session_id always returns the same environment during a deployment

Automatic Rollback Decision Algorithm

The rollback algorithm must distinguish a real error rate spike from statistical noise. A 1% error rate threshold that triggers immediately on the first bad second will fire on every brief transient. We use a sliding window with a minimum sample size requirement before the decision fires.

# Rollback decision algorithm - sliding window with minimum sample gate
import collections
import time
from typing import Deque

class ErrorRateMonitor:
    def __init__(self, window_seconds: int = 60, min_requests: int = 100,
                 threshold: float = 0.01):
        self.window = window_seconds
        self.min_requests = min_requests
        self.threshold = threshold
        # Sliding window: list of (timestamp, is_error) tuples
        self._samples: Deque = collections.deque()

    def record(self, is_error: bool):
        now = time.time()
        self._samples.append((now, is_error))
        # Evict expired samples
        cutoff = now - self.window
        while self._samples and self._samples[0][0] < cutoff:
            self._samples.popleft()

    def should_rollback(self) -> tuple[bool, str]:
        if len(self._samples) < self.min_requests:
            # Not enough data to make a decision - don't rollback on noise
            return False, f"insufficient samples ({len(self._samples)}/{self.min_requests})"

        total = len(self._samples)
        errors = sum(1 for _, is_error in self._samples if is_error)
        error_rate = errors / total

        if error_rate > self.threshold:
            return True, f"error rate {error_rate:.3%} > threshold {self.threshold:.3%} over {total} requests"
        return False, f"error rate {error_rate:.3%} within threshold"

    def reset(self):
        self._samples.clear()
Key Insight

The min_requests gate is what prevents false rollbacks during the first seconds of a canary deployment when a single bad request would otherwise be 100% error rate. Wait for at least 100 samples before acting - at 1000 RPS on a 5% canary, that’s about 2 seconds of data.

Canary Promotion Ladder

The progression from 0% to 100% isn’t a single jump. It’s a ladder: 5% for 30 seconds, then 20%, 50%, 80%, 100%, each step with a 5-second observation window after the initial canary. This gives the rollback algorithm enough samples at each weight to detect problems early before they affect the majority of traffic.

# Canary promotion ladder configuration
PROMOTION_LADDER = [
    {"weight": 5,   "observe_seconds": 30},  # canary: small slice, long watch
    {"weight": 20,  "observe_seconds": 5},
    {"weight": 50,  "observe_seconds": 5},
    {"weight": 80,  "observe_seconds": 5},
    {"weight": 100, "observe_seconds": 10},   # final: full traffic, short verify
]

Scaling and Performance

The deployment controller is a control plane component - it processes one deployment at a time per service, so it doesn’t need to handle high request throughput. The scaling challenges are different: parallelism across services (100 services deploying simultaneously), multi-region fan-out (propagating weight changes to 5 regions), and health check storm (polling 1000 pods every 2 seconds during active deployments).

Multi-region blue-green scaling diagram showing phased rollout across US-EAST and EU-WEST regions with Global Deploy Controller orchestration
Back-of-envelope capacity estimation:

Given:
  - 50 active services in blue-green deployment simultaneously
  - Each service has 20 pods in green + 20 pods in blue = 40 pods
  - Health check every 2 seconds per pod during active deployment
  - 5 regions

Health check QPS:
  50 services * 40 pods * (1 / 2s) * 5 regions = 5,000 health checks/second

State store write rate (weight updates):
  50 services * 6 weight steps * 1 per deployment = 300 writes per deploy cycle
  Assuming deploys staggered over 10 min: ~0.5 writes/second to state store

Nginx config reload rate:
  50 services * 6 config regenerations = 300 reloads per deploy cycle
  Each reload: ~100ms graceful worker handoff, no connection drops

Traffic Splitter API call rate (Envoy xDS):
  300 weight updates * 5 regions * 3 xDS control plane replicas = 4,500 xDS pushes
  xDS push latency: < 50ms per region

Storage (audit log):
  300 state transitions * 1 KB per event = 300 KB per deploy cycle
  At 10 deploy cycles/hour * 24h * 365d: ~26 GB/year (trivially small)

The health check load is the dominant concern. At 5,000 health checks per second, a single health check service becomes a bottleneck. The solution is sharding health checkers by service: assign each service to a dedicated health check worker, co-located in the same region as the pods it monitors. This reduces cross-region health check traffic and lets each worker scale independently.

Real World

Netflix’s Spinnaker deployment platform handles multi-region blue-green deployments by maintaining region-local deployment agents that receive commands from a global orchestrator. Each regional agent makes its own health check decisions and reports back - this avoids a central health check bottleneck and makes the system resilient to inter-region network partitions during a deployment.

Failure Modes and Recovery

FailureDetectionImpactRecovery
Controller crashes mid-shift (e.g., at 50% green)No heartbeat in state store for 30sTraffic stays at 50/50 split indefinitelyStandby controller acquires lock, reads last committed state, resumes from current weight step
Green pod crashes during canaryHealth check 503 for 3 consecutive pollsCanary traffic hits error until detection (6s max)Controller calls rollback: weight=0 on green within 2s, error budget hit triggers alert
Database migration breaks backward compat5xx errors from blue pods (reading new schema)All blue traffic starts failingRollback green weight to 0 immediately; the root cause is the migration (expand-contract pattern violation)
Nginx reload drops connectionsMonitoring shows connection reset spikeBrief connection interruption for in-flight requestsUse upstream_keepalive + test reload under load in staging; Envoy xDS avoids this entirely
State store unavailable (Postgres down)Controller cannot write state transitionIn-flight deployment pauses; no further shiftsController retries with exponential backoff; traffic weight stays at last committed value - safe
Smoke test false negative (test bug)Smoke runner returns failure on healthy greenDeployment blocked indefinitelyAlert on smoke test failure with test name + response body; allow manual override with audit log entry
Watch Out

The most common operational mistake is keeping the blue environment’s minimum pod count too low after promotion. If blue scales down to 0 replicas immediately after green is promoted, rollback takes 2-3 minutes to spin blue back up instead of 5 seconds. Keep blue at full replica count for at least 1 hour post-promotion - the infrastructure cost is one hour of double capacity, which is almost always worth the rollback speed.

Comparison of Approaches

ApproachTraffic Shift SpeedRollback SpeedResource CostBest Fit
Blue-green (this design)Full shift in 30sInstant (weight flip, 2s)2x capacity during deployHigh-availability services with strict rollback SLA
Rolling update (Kubernetes default)Gradual, per-podSlow (new rollout required, 2-5 min)1x capacityStateless services tolerating brief mixed-version traffic
Canary-only (no full switch)Stays partial indefinitelyFast (reduce canary %)1.1x capacityRisk-averse teams, ML model rollouts where gradual exposure is desired
Feature flags (no redeploy)Instant (config change)Instant1x capacityBehavioral changes; requires flag infrastructure and clean separation of new code
Shadow deployment (traffic mirroring)N/A (read-only)N/A2x capacityValidation of new version correctness before any traffic shift; no user impact
A/B testingPartial, user-segmentedPartial (segment rollback)2x capacityFeature experiments where you want a permanent % split for measurement

Blue-green is the right choice when your rollback window is measured in seconds, not minutes. The 2x capacity cost is the price of that guarantee. For services where 2 minutes of rollback is acceptable, a Kubernetes rolling update with maxUnavailable=0 achieves similar zero-downtime properties at half the infrastructure cost. For teams building on service meshes (Istio, Linkerd), canary-only deployments with traffic shifting are often simpler to operate than full blue-green because the mesh handles routing natively - though rollback is never as clean as a binary weight flip.

Key Takeaways

  • Traffic switching must happen at the load balancer layer - Nginx upstream weights or Envoy xDS changes propagate in under 2 seconds, while DNS TTLs take 30 seconds to 5 minutes.
  • The rollback mechanism is always running - the old environment never stops, which means rollback is a weight update, not a deployment operation.
  • Smoke tests must bypass the Traffic Splitter - they must hit green directly via internal URL at 0% live weight; otherwise they never reach the new environment.
  • Session draining prevents mid-flight request drops - keep blue alive and accepting connections until in-flight requests complete before scaling it down.
  • The state machine must commit before acting - write the target state to the store before executing the action, so a crash between commit and execution is always recoverable on restart.
  • Minimum sample gating prevents rollback on noise - require at least 100 requests in the observation window before the error rate threshold triggers a rollback.
  • Database backward compatibility is a prerequisite, not a feature - blue-green requires that the new schema is readable by the old code; use expand-contract migrations, never destructive schema changes during an active deployment.
  • Keep blue hot for 1 hour post-promotion - the 2x capacity cost buys instant rollback; scaling blue down immediately converts rollback from 5 seconds to 3 minutes.

The counter-intuitive lesson from blue-green systems is that the hardest part isn’t the traffic shift - it’s the database layer. The moment you have two versions of your application running against the same database, every schema change must be backward-compatible. Teams that skip this constraint eventually discover it during an emergency rollback, when the old code can’t read the new schema and the rollback itself causes a data layer outage. Blue-green’s most dangerous failure mode is not in the deployment system at all - it’s in the migration strategy.

Frequently Asked Questions

Q: Why not just use Kubernetes rolling updates instead of building blue-green? A: Rolling updates provide zero-downtime but not instant rollback. During a rolling update, old and new pods coexist, and rolling back requires a new rolling update - which takes 2-5 minutes. Blue-green maintains two complete environments so rollback is a single weight flip taking 2 seconds. The tradeoff is 2x capacity cost during the deployment window.

Q: How do you handle database migrations when both blue and green are running simultaneously? A: Enforce the expand-contract (parallel change) pattern: migrations must be applied in three separate deployments. First deployment adds the new column (expand phase, blue reads old column, green can use either). Second deployment migrates data and switches code to new column. Third deployment removes the old column (contract phase, after blue is fully decommissioned). Never make a schema change that breaks the running version.

Q: What happens if the controller crashes at exactly 50% green weight? A: The deployment pauses at 50/50 indefinitely - traffic continues to split, neither environment is torn down. A standby controller instance detects the missing heartbeat in the state store (after 30 seconds), acquires the lock, reads the last committed state (50% green), and continues the promotion ladder from where it left off. This is why the state must be committed before the weight change is applied.

Q: Why use consistent hashing per session instead of random per-request routing? A: Random per-request routing means a single user’s page load might hit blue for the API call that fetches their data and green for the API call that mutates it. If blue and green have different data models or API response shapes, this can corrupt user state. Consistent hashing ensures a given session always hits the same environment during the deployment window.

Q: Can blue-green work for services with very long-running connections like WebSockets or gRPC streaming? A: Yes, but session draining becomes critical and the drain timeout must be set to the maximum expected connection lifetime. For WebSocket connections that can last hours, you may need a separate notification mechanism (close frame with reconnect code) that tells clients to reconnect, which will land on green. The alternative is to route streaming connections separately and never shift them during a deployment.

Q: Why not use feature flags for all deployments instead of blue-green? A: Feature flags require that every behavioral change is wrapped in a flag check in the code. This works well for feature releases but fails for infrastructure changes (dependency upgrades, framework migrations, security patches) where the change is pervasive and can’t be wrapped in a conditional. Blue-green works regardless of what changed - the new binary is either healthy or it isn’t.

Interview Questions

Q: Walk me through how you’d implement instant rollback in a blue-green system. Expected depth: Explain that rollback is a single upstream weight update at the load balancer (Nginx weight=0 for green, or Envoy xDS push). Discuss propagation latency (under 2s for xDS, under 5s for Nginx reload). Explain why DNS-level rollback is unacceptable (TTL latency, client-side caching). Mention that blue must be kept at full replica count for the rollback to be instant - if blue scaled down, rollback triggers a pod spin-up which takes 2-3 minutes.

Q: How would you handle a database schema change as part of a blue-green deployment? Expected depth: Describe the expand-contract pattern in detail: three-phase migration across three separate deployments. Explain why a single-deployment schema change that drops a column breaks the old (blue) version. Discuss the timing: the expand migration runs before the green deployment starts, so both blue and green can read the new schema during the traffic shift. The contract (cleanup) migration runs after blue is decommissioned.

Q: Your rollback algorithm has a minimum sample requirement of 100 requests before it will trigger. What are the edge cases? Expected depth: Discuss low-traffic services (100 RPS * 5% canary = 5 RPS to green - 100 samples take 20 seconds). For very low traffic services, you may need to lower min_requests or extend the canary observation window. Also discuss the startup burst problem: green starts receiving traffic after a cold start, the first few requests hit high latency (JVM warmup, cache cold), which inflates error rates. Add a warm-up period (30s at 0% live traffic after smoke tests) before starting the canary.

Q: How would you design the health checker to avoid a thundering herd during a simultaneous deploy of 100 services? Expected depth: Shard health checkers by service - each service gets a dedicated health check worker, not a shared pool. Stagger deployment start times even when triggering multiple deploys simultaneously (add jitter). Use a pull-based model where pods expose Prometheus metrics and the health checker scrapes them on a schedule, rather than a push model where every pod reports to a central endpoint. Discuss circuit breaking on the health checker itself to avoid cascading failures when a large service has all pods unhealthy.

Q: How do you ensure the deployment controller itself doesn’t become a single point of failure? Expected depth: Run multiple controller replicas in active-standby mode using distributed locks in the state store (Postgres advisory locks or etcd leases). The active controller holds the lock and renews it via heartbeat. If the heartbeat stops, a standby acquires the lock within 30 seconds and resumes. Discuss the recovery procedure: read last committed state, verify current infrastructure state matches (pod counts, upstream weights), reconcile any drift before continuing. This is the same pattern Kubernetes controller-manager uses.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article