The Region That Went Dark

System Design Scenario

The Region That Went Dark

When your entire stack lives in one AWS region and that region decides to take the night off

⏱ 12 min read📐 Intermediate🔒 Cloud Infrastructure

It’s 2:14 AM on a Tuesday. Your phone buzzes with the particular vibration pattern that means PagerDuty, not a text from a friend. You fumble it off the nightstand. CRITICAL: All health checks failing. Error rate: 100%. Revenue impact: $47K/min. You open your laptop and try to hit the dashboard. Nothing loads. You try the API directly. Connection timeout. You check Twitter - it’s already trending: “AWS us-east-1 outage.”

Your entire company - every service, every database, every queue, every byte of customer data - lives in a single AWS region. And that region just went dark.

This isn’t a hypothetical. On November 25, 2020, AWS us-east-1 experienced a major outage that took down a significant portion of the internet. Slack, Roku, Adobe, Flickr, Coinbase - all offline. On December 7, 2021, it happened again. The teams that recovered in minutes had one thing in common: they weren’t depending on a single region. The teams that were down for hours had something else in common: they’d been meaning to fix that.

You open your infrastructure diagram. It’s a single box labeled “us-east-1” with everything inside it. Load balancer, app servers, RDS database, ElastiCache, SQS queues, S3 buckets. The architecture that was “good enough for now” six months ago is now costing the company $2,800 every minute you sit here in the dark.

This is the multi-region availability problem - and the gap between knowing you need it and actually implementing it is where outages live.

Why This Happens

The default path on every cloud provider leads to single-region deployments. When you spin up an EC2 instance, it goes in one region. When you create an RDS database, it lives in one region. When you follow the “getting started” tutorial, everything lands in us-east-1 because it’s the default, it has the most services available, and it’s where all the examples point.

Single-region architectures aren’t a mistake - they’re a gravitational inevitability. Every additional region doubles your infrastructure cost, adds data consistency complexity, and introduces failure modes that didn’t exist before. Teams avoid multi-region because the engineering cost is real and the outage probability feels theoretical. Until it isn’t.

The failure chain is predictable:

AWS region experiences control plane issue
  → EC2 instances become unreachable
    → ALB health checks fail, no healthy targets
      → DNS still points to dead load balancer
        → All requests timeout (no alternative target)
          → 100% of users experience outage
            → Recovery depends entirely on AWS fixing the issue
              → You have zero control over your RTO

The core issue isn’t that regions fail - it’s that a single-region architecture has zero redundancy at the highest level of the infrastructure hierarchy. You might have three Availability Zones within that region, giving you redundancy against a single data center failure. But AZs share control planes, networking fabric, and regional services. When the region goes, the AZs go with it.

Core Insight

Availability Zones protect you from localized hardware failures. They do not protect you from regional control plane outages, regional networking issues, or regional service degradations. Multi-AZ is not multi-region.

The Naive Solution

The first thing most teams reach for after a regional outage is active-passive failover - a cold standby region that sits idle until needed. The reasoning is straightforward: keep a copy of everything in us-west-2, and if us-east-1 goes down, flip traffic over.

Single region architecture showing complete failure when us-east-1 goes down

The problem with cold standby is that it’s cold. Your secondary region hasn’t served real traffic in months. The database replica exists, but has it been tested? Are the ECS task definitions up to date? Did someone remember to deploy last Thursday’s config change to the standby region? The AMIs might be three versions behind. The IAM roles might reference resources that no longer exist.

Here’s where active-passive breaks down at scale:

Failover initiated at t=0
  → DNS TTL propagation: 60-300 seconds (depending on client caching)
    → DB promotion from read replica to primary: 30-60 seconds
      → Cold containers need to scale from 0 to handle full traffic: 3-8 minutes
        → Cache is completely empty (cold start): 15+ minutes of DB pressure
          → Configuration drift discovered: manual intervention needed
            → Actual RTO: 15-45 minutes (not the 5 minutes in the runbook)

The scale breakpoint is confidence. At low complexity - a stateless API with a single database - active-passive works fine. Once you have stateful services, event queues with in-flight messages, cache dependencies, and cross-service communication, the “just flip it over” fantasy collapses under the weight of untested assumptions.

Warning

Active-passive failover has a dirty secret: the failover itself is the riskiest operation you’ll perform. If you’ve never tested it under real load, your first test will be during the actual outage. Netflix calls untested failover plans “recovery theater.”

The Better Solution

The answer is active-active multi-region - both regions serve real production traffic all the time. No cold standby. No untested failover paths. Every region is battle-hardened by continuous real traffic.

Multi-region active-active architecture with automatic failover via Route 53

This requires solving three distinct layers: global traffic routing, stateless service replication, and data synchronization.

Layer 1: Global Load Balancing

Traffic must route to the nearest healthy region automatically, without human intervention. AWS provides two mechanisms for this:

# Terraform - Route 53 failover routing
resource "aws_route53_health_check" "primary" {
  fqdn              = "api-east.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = 3
  request_interval  = 10

  tags = {
    Name = "primary-region-health-check"
  }
}

resource "aws_route53_record" "api" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"

  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }

  failover_routing_policy {
    type = "PRIMARY"
  }

  health_check_id = aws_route53_health_check.primary.id
  set_identifier  = "primary"
}

resource "aws_route53_record" "api_secondary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"

  alias {
    name                   = aws_lb.secondary.dns_name
    zone_id                = aws_lb.secondary.zone_id
    evaluate_target_health = true
  }

  failover_routing_policy {
    type = "SECONDARY"
  }

  set_identifier = "secondary"
}

For true active-active (both regions handling traffic simultaneously), use latency-based routing instead:

# Latency-based routing - traffic goes to nearest healthy region
resource "aws_route53_record" "api_east" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"

  alias {
    name                   = aws_lb.east.dns_name
    zone_id                = aws_lb.east.zone_id
    evaluate_target_health = true
  }

  latency_routing_policy {
    region = "us-east-1"
  }

  health_check_id = aws_route53_health_check.east.id
  set_identifier  = "east"
}

Real-World

Netflix uses a combination of Route 53 and their own Zuul gateway to route traffic across three AWS regions (us-east-1, us-west-2, eu-west-1). During the 2021 us-east-1 outage, Netflix remained operational because their system automatically shifted traffic to the remaining two regions within seconds.

Layer 2: Stateless Service Replication

Your application servers need to run identically in both regions. This is the easiest layer if your services are truly stateless:

# ECS Service definition - deployed to both regions via CI/CD
# aws-ecs-service.yml
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  ApiService:
    Type: AWS::ECS::Service
    Properties:
      Cluster: !Ref ECSCluster
      DesiredCount: 4
      TaskDefinition: !Ref ApiTaskDefinition
      LoadBalancers:
        - ContainerName: api
          ContainerPort: 8080
          TargetGroupArn: !Ref ApiTargetGroup
      DeploymentConfiguration:
        MinimumHealthyPercent: 75
        MaximumPercent: 200
      # Auto-scaling to absorb failover traffic
      ServiceScalingTarget:
        MinCapacity: 4
        MaxCapacity: 20

The key detail: each region must have enough spare capacity to absorb the other region’s traffic during failover. If both regions run at 50% capacity during normal operation, either can handle 100% during a failure. This is the “N+1 at the region level” principle.

Layer 3: Data Replication

This is where multi-region gets hard. Stateless services are easy to replicate. Data is not.

RTO (Recovery Time Objective) is how long you can be down. RPO (Recovery Point Objective) is how much data you can afford to lose. These two numbers determine your replication strategy.

# Aurora Global Database - async replication, sub-second lag
resource "aws_rds_global_cluster" "main" {
  global_cluster_identifier = "example-global-cluster"
  engine                    = "aurora-postgresql"
  engine_version            = "14.5"
  database_name             = "app_production"
}

resource "aws_rds_cluster" "primary" {
  cluster_identifier        = "example-cluster-primary"
  engine                    = "aurora-postgresql"
  engine_version            = "14.5"
  global_cluster_identifier = aws_rds_global_cluster.main.id
  master_username           = "admin"
  master_password           = var.db_password
  availability_zones        = ["us-east-1a", "us-east-1b", "us-east-1c"]
}

resource "aws_rds_cluster" "secondary" {
  provider                  = aws.west
  cluster_identifier        = "example-cluster-secondary"
  engine                    = "aurora-postgresql"
  engine_version            = "14.5"
  global_cluster_identifier = aws_rds_global_cluster.main.id
  availability_zones        = ["us-west-2a", "us-west-2b", "us-west-2c"]

  # This cluster starts as a reader - promoted on failover
  depends_on = [aws_rds_cluster.primary]
}

Data replication lag is the gap between what’s written in the primary region and what’s available in the secondary. Aurora Global Database typically maintains under 1 second of lag, meaning your RPO is roughly 1 second of transactions during an unplanned failover.

Real-World

DynamoDB Global Tables provide active-active replication with eventual consistency. Writes in any region propagate to all other regions within roughly 1 second. Stripe uses DynamoDB Global Tables for their idempotency keys - ensuring that a payment processed in us-east-1 won’t be double-processed in us-west-2 during failover.

The Full Architecture

Full multi-region active-active architecture with all components

The happy path works like this: a user in New York hits api.example.com. Route 53 resolves this to the us-east-1 ALB based on latency routing. The request hits an ECS container, which reads from the local Aurora reader and ElastiCache instance. Writes go to the Aurora primary writer in us-east-1. Aurora replicates those writes to us-west-2 asynchronously.

A user in San Francisco hits the same domain. Route 53 routes them to us-west-2. Their reads are served locally. Their writes still go to the Aurora primary in us-east-1 (unless you’re using DynamoDB Global Tables, which accept writes in any region).

When us-east-1 fails: Route 53 health checks detect the failure within 30 seconds. DNS updates route all traffic to us-west-2. The Aurora secondary cluster promotes to primary (roughly 30 seconds). West coast users see no interruption. East coast users experience 30-90 seconds of errors during DNS propagation, then resume normally.

Component Deep Dives

Failover Automation

Manual failover at 2 AM is a disaster recipe. The automation needs to handle detection, decision, and execution without human input:

# Lambda function triggered by CloudWatch alarm
import boto3
import time

def handler(event, context):
    """Automated failover orchestrator."""
    route53 = boto3.client('route53')
    rds = boto3.client('rds', region_name='us-west-2')

    # Step 1: Verify the outage is real (avoid flapping)
    if not verify_sustained_failure(event):
        return {"status": "transient", "action": "none"}

    # Step 2: Promote Aurora secondary to primary
    rds.failover_global_cluster(
        GlobalClusterIdentifier='example-global-cluster',
        TargetDbClusterIdentifier='arn:aws:rds:us-west-2:123456789:cluster:example-cluster-secondary'
    )

    # Step 3: Wait for promotion to complete
    waiter = rds.get_waiter('db_cluster_available')
    waiter.wait(DBClusterIdentifier='example-cluster-secondary')

    # Step 4: Update Route 53 if not using automatic failover
    # (If using failover routing policy, this happens automatically)

    # Step 5: Scale up secondary region
    ecs = boto3.client('ecs', region_name='us-west-2')
    ecs.update_service(
        cluster='production',
        service='api-service',
        desiredCount=20  # absorb full traffic
    )

    return {"status": "failover_complete", "timestamp": time.time()}


def verify_sustained_failure(event):
    """Require 3 consecutive failures before triggering failover."""
    cloudwatch = boto3.client('cloudwatch', region_name='us-east-1')
    # Check last 3 data points of health check metric
    response = cloudwatch.get_metric_data(
        MetricDataQueries=[{
            'Id': 'health',
            'MetricStat': {
                'Metric': {
                    'Namespace': 'AWS/Route53',
                    'MetricName': 'HealthCheckStatus',
                    'Dimensions': [{'Name': 'HealthCheckId', 'Value': event['health_check_id']}]
                },
                'Period': 10,
                'Stat': 'Minimum'
            }
        }],
        StartTime=time.time() - 60,
        EndTime=time.time()
    )
    failures = [v for v in response['MetricDataResults'][0]['Values'] if v == 0]
    return len(failures) >= 3

Cross-Region Cache Warming

An empty cache in the failover region means your database gets hammered with the full request volume immediately after failover. Pre-warming solves this:

# Cache warming service running in secondary region
import redis
import json
import boto3

class CacheWarmer:
    def __init__(self, primary_region='us-east-1', secondary_region='us-west-2'):
        self.kinesis = boto3.client('kinesis', region_name=primary_region)
        self.local_cache = redis.Redis(host='elasticache.us-west-2.amazonaws.com')

    def process_replication_stream(self):
        """Consume cache invalidation events and pre-warm locally."""
        shard_iterator = self.kinesis.get_shard_iterator(
            StreamName='cache-invalidation-stream',
            ShardId='shardId-000000000000',
            ShardIteratorType='LATEST'
        )['ShardIterator']

        while True:
            response = self.kinesis.get_records(
                ShardIterator=shard_iterator,
                Limit=100
            )
            for record in response['Records']:
                event = json.loads(record['Data'])
                self.local_cache.set(
                    event['key'],
                    event['value'],
                    ex=event['ttl']
                )
            shard_iterator = response['NextShardIterator']

Health Check Endpoint

The health check isn’t just “is the process running” - it verifies the full dependency chain:

// Health check endpoint that validates full stack
func healthHandler(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
    defer cancel()

    checks := map[string]error{
        "database":  checkDatabase(ctx),
        "cache":     checkRedis(ctx),
        "queue":     checkSQS(ctx),
    }

    allHealthy := true
    results := make(map[string]string)
    for name, err := range checks {
        if err != nil {
            allHealthy = false
            results[name] = err.Error()
        } else {
            results[name] = "ok"
        }
    }

    if !allHealthy {
        w.WriteHeader(http.StatusServiceUnavailable)
    }
    json.NewEncoder(w).Encode(results)
}

Design Insight

Your health check endpoint determines your failover sensitivity. Too shallow (just return 200) and you’ll route traffic to a region that’s technically running but can’t serve requests. Too deep (check every downstream dependency) and a single slow query triggers unnecessary failovers.

Comparison Table

Approach	RTO	RPO	Cost Multiplier	Complexity	Best For
Single Region (Multi-AZ)	Depends on AWS	0 (synchronous within region)	1x	Low	Non-critical apps, early startups
Active-Passive (Cold)	15-45 min	1-5 min (async repl)	1.3x	Medium	Moderate SLAs, cost-sensitive
Active-Passive (Warm)	5-15 min	Under 1 min	1.6x	Medium-High	Business-critical with budget constraints
Active-Active (Read local, Write primary)	30-90 sec	Under 1 sec	2x	High	High-availability SaaS, fintech
Active-Active (Multi-writer)	Under 10 sec	~0 (conflict resolution)	2.2x+	Very High	Global apps, gaming, real-time systems
Cell-Based Architecture	Under 5 sec (per cell)	0 (isolated cells)	2.5x+	Extreme	Hyperscale (AWS, Azure internal)

Key Takeaways

Multi-region is insurance, not optimization. You’re paying 2x infrastructure cost for the 0.1% of the time a region fails. The math only works if that 0.1% would cost you more than the infrastructure premium.
Active-active eliminates the “untested failover” problem. If both regions serve real traffic daily, you know they work. Cold standbys rot.
Data replication lag is your RPO. Aurora Global Database gives you sub-second RPO. DynamoDB Global Tables give you eventual consistency with last-writer-wins conflict resolution. Choose based on your tolerance for data loss.
DNS TTL is your floor for RTO. No matter how fast your failover automation runs, clients with cached DNS records won’t see the change until TTL expires. Keep Route 53 TTLs at 60 seconds or less for failover records.
Failover automation must be tested regularly. Netflix runs “region evacuation” drills monthly. If your failover hasn’t been tested in the last 30 days, it doesn’t work.
Cache warming is the hidden gotcha. A failover that redirects traffic to a region with an empty cache just moves the outage from “region down” to “database overwhelmed.” Pre-warm or accept the cold-start penalty.
Global load balancing is the entry point. Without Route 53 health checks or Global Accelerator, nothing else matters - you have no mechanism to redirect traffic away from a failed region.
Cost scales linearly, complexity scales exponentially. Going from 1 region to 2 regions doesn’t double your engineering effort - it quadruples it due to data consistency, deployment coordination, and testing requirements.

Multi-region isn’t something you bolt on during an outage at 2 AM. It’s a foundational architecture decision that shapes how you build services, how you handle data, and how you deploy code. The teams that survive regional outages planned for them months before they happened.

FAQ

Q: Can’t I just use multiple Availability Zones instead of multiple regions?

AZs share a regional control plane. When the region’s control plane fails (as happened in the 2021 us-east-1 outage), all AZs in that region are affected. Multi-AZ protects against single data center failures - hardware issues, power outages, local networking problems. Multi-region protects against the entire regional infrastructure going down.

Q: How do I handle database writes in an active-active setup?

Two approaches: single-writer (all writes route to one region’s database, with async replication to the secondary) or multi-writer (DynamoDB Global Tables, CockroachDB, or application-level conflict resolution). Single-writer is simpler but adds write latency for users far from the primary. Multi-writer eliminates that latency but introduces conflict resolution complexity.

Q: What’s the minimum viable multi-region setup for a startup?

Route 53 failover routing + Aurora Global Database + identical ECS services in two regions. You can start with active-passive (secondary region at minimal capacity) and graduate to active-active once traffic justifies the cost. Budget roughly 1.5x your current infrastructure spend for the passive approach.

Q: How do I handle in-flight requests during failover?

They fail. That’s the reality. Any request in-flight when the region goes down will timeout. Your clients need retry logic with exponential backoff. The goal isn’t zero dropped requests - it’s minimizing the window where requests fail. With Route 53 failover and 60-second TTLs, that window is 30-90 seconds for most clients.

Q: Does multi-region mean my deployments are twice as complex?

Yes, but CI/CD automation absorbs most of that complexity. Deploy to both regions in parallel (not sequentially). Use feature flags to control rollout. If a deployment breaks one region, the other continues serving traffic while you roll back. Multi-region actually gives you safer deployments if you do canary releases region-by-region.

Q: How does data replication lag affect user experience?

If a user writes data in us-east-1 and immediately reads from us-west-2, they might not see their own write. This is the “read-your-own-writes” consistency problem. Solutions include sticky sessions (route a user to the same region consistently), read-from-primary for critical paths, or versioned reads where the client includes the last-known version in read requests.

Interview Questions

Q: Design a multi-region failover system for a payment processing platform. What are your RTO and RPO requirements, and how do they influence your architecture?

Expected depth: Discuss why payments require near-zero RPO (you can’t lose transaction records), how this pushes toward synchronous replication or event sourcing with replay capability. Cover idempotency keys for preventing double-charges during failover. Mention that RTO for payments is often contractually bound (SLA with merchants) and typically needs to be under 60 seconds.

Q: How would you handle the “split-brain” scenario where both regions believe they are the primary writer?

Expected depth: Explain that split-brain occurs when regions lose connectivity but both remain operational. Discuss fencing tokens, leader election via a third-party consensus service (e.g., DynamoDB lock table in a third region), and why “fail-closed” (refuse writes) is often safer than “fail-open” (accept writes and reconcile later) for financial systems.

Q: Compare active-active vs active-passive for a social media feed. Which would you recommend and why?

Expected depth: Social media feeds are read-heavy and eventually-consistent. Active-active with local reads is ideal because read latency matters more than strict consistency. Discuss how a user seeing a 1-second-stale feed is acceptable, but a payment being 1-second-stale is not. Cover the write-routing problem for posts (single-writer vs conflict resolution).

Q: Your multi-region system has a data replication lag of 800ms during peak hours. A user writes a comment in Region A and immediately refreshes in Region B. They don’t see their comment. How do you solve this without making all reads go to the primary?

Expected depth: Discuss session affinity (route same user to same region), causal consistency tokens (client sends write-timestamp, reader waits for replication to catch up), read-your-own-writes middleware (check primary only for the user who just wrote), and the tradeoff between consistency and latency at each solution.

Q: Walk me through a regional failover drill. What do you test, how do you minimize blast radius, and what metrics tell you the drill succeeded?

Expected depth: Cover synthetic traffic injection, percentage-based traffic shifting (start by routing 1% away, then 10%, then 100%), monitoring for error rate spikes during the shift, database promotion timing, cache hit rate in the secondary region, and the criteria for declaring the drill successful (error rate stays below SLA threshold, latency stays within bounds, no data loss detected post-drill).

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access

Unlock Full Article