The Region That Went Dark
cloud-infrastructure reliability
System Design Scenario
The Region That Went Dark
When your entire stack lives in one AWS region and that region decides to take the night off
It’s 2:14 AM on a Tuesday. Your phone buzzes with the particular vibration pattern that means PagerDuty, not a text from a friend. You fumble it off the nightstand. CRITICAL: All health checks failing. Error rate: 100%. Revenue impact: $47K/min. You open your laptop and try to hit the dashboard. Nothing loads. You try the API directly. Connection timeout. You check Twitter - it’s already trending: “AWS us-east-1 outage.”
Your entire company - every service, every database, every queue, every byte of customer data - lives in a single AWS region. And that region just went dark.
This isn’t a hypothetical. On November 25, 2020, AWS us-east-1 experienced a major outage that took down a significant portion of the internet. Slack, Roku, Adobe, Flickr, Coinbase - all offline. On December 7, 2021, it happened again. The teams that recovered in minutes had one thing in common: they weren’t depending on a single region. The teams that were down for hours had something else in common: they’d been meaning to fix that.
You open your infrastructure diagram. It’s a single box labeled “us-east-1” with everything inside it. Load balancer, app servers, RDS database, ElastiCache, SQS queues, S3 buckets. The architecture that was “good enough for now” six months ago is now costing the company $2,800 every minute you sit here in the dark.
This is the multi-region availability problem - and the gap between knowing you need it and actually implementing it is where outages live.
Why This Happens
The default path on every cloud provider leads to single-region deployments. When you spin up an EC2 instance, it goes in one region. When you create an RDS database, it lives in one region. When you follow the “getting started” tutorial, everything lands in us-east-1 because it’s the default, it has the most services available, and it’s where all the examples point.
Single-region architectures aren’t a mistake - they’re a gravitational inevitability. Every additional region doubles your infrastructure cost, adds data consistency complexity, and introduces failure modes that didn’t exist before. Teams avoid multi-region because the engineering cost is real and the outage probability feels theoretical. Until it isn’t.
The failure chain is predictable:
AWS region experiences control plane issue
→ EC2 instances become unreachable
→ ALB health checks fail, no healthy targets
→ DNS still points to dead load balancer
→ All requests timeout (no alternative target)
→ 100% of users experience outage
→ Recovery depends entirely on AWS fixing the issue
→ You have zero control over your RTO
The core issue isn’t that regions fail - it’s that a single-region architecture has zero redundancy at the highest level of the infrastructure hierarchy. You might have three Availability Zones within that region, giving you redundancy against a single data center failure. But AZs share control planes, networking fabric, and regional services. When the region goes, the AZs go with it.
Core Insight
Availability Zones protect you from localized hardware failures. They do not protect you from regional control plane outages, regional networking issues, or regional service degradations. Multi-AZ is not multi-region.
The Naive Solution
The first thing most teams reach for after a regional outage is active-passive failover - a cold standby region that sits idle until needed. The reasoning is straightforward: keep a copy of everything in us-west-2, and if us-east-1 goes down, flip traffic over.
The problem with cold standby is that it’s cold. Your secondary region hasn’t served real traffic in months. The database replica exists, but has it been tested? Are the ECS task definitions up to date? Did someone remember to deploy last Thursday’s config change to the standby region? The AMIs might be three versions behind. The IAM roles might reference resources that no longer exist.
Here’s where active-passive breaks down at scale:
Failover initiated at t=0
→ DNS TTL propagation: 60-300 seconds (depending on client caching)
→ DB promotion from read replica to primary: 30-60 seconds
→ Cold containers need to scale from 0 to handle full traffic: 3-8 minutes
→ Cache is completely empty (cold start): 15+ minutes of DB pressure
→ Configuration drift discovered: manual intervention needed
→ Actual RTO: 15-45 minutes (not the 5 minutes in the runbook)
The scale breakpoint is confidence. At low complexity - a stateless API with a single database - active-passive works fine. Once you have stateful services, event queues with in-flight messages, cache dependencies, and cross-service communication, the “just flip it over” fantasy collapses under the weight of untested assumptions.
Warning
Active-passive failover has a dirty secret: the failover itself is the riskiest operation you’ll perform. If you’ve never tested it under real load, your first test will be during the actual outage. Netflix calls untested failover plans “recovery theater.”
The Better Solution
The answer is active-active multi-region - both regions serve real production traffic all the time. No cold standby. No untested failover paths. Every region is battle-hardened by continuous real traffic.
This requires solving three distinct layers: global traffic routing, stateless service replication, and data synchronization.
Layer 1: Global Load Balancing
Traffic must route to the nearest healthy region automatically, without human intervention. AWS provides two mechanisms for this:
# Terraform - Route 53 failover routing
resource "aws_route53_health_check" "primary" {
fqdn = "api-east.example.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 10
tags = {
Name = "primary-region-health-check"
}
}
resource "aws_route53_record" "api" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
alias {
name = aws_lb.primary.dns_name
zone_id = aws_lb.primary.zone_id
evaluate_target_health = true
}
failover_routing_policy {
type = "PRIMARY"
}
health_check_id = aws_route53_health_check.primary.id
set_identifier = "primary"
}
resource "aws_route53_record" "api_secondary" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
alias {
name = aws_lb.secondary.dns_name
zone_id = aws_lb.secondary.zone_id
evaluate_target_health = true
}
failover_routing_policy {
type = "SECONDARY"
}
set_identifier = "secondary"
}
For true active-active (both regions handling traffic simultaneously), use latency-based routing instead:
# Latency-based routing - traffic goes to nearest healthy region
resource "aws_route53_record" "api_east" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
alias {
name = aws_lb.east.dns_name
zone_id = aws_lb.east.zone_id
evaluate_target_health = true
}
latency_routing_policy {
region = "us-east-1"
}
health_check_id = aws_route53_health_check.east.id
set_identifier = "east"
}
Real-World
Netflix uses a combination of Route 53 and their own Zuul gateway to route traffic across three AWS regions (us-east-1, us-west-2, eu-west-1). During the 2021 us-east-1 outage, Netflix remained operational because their system automatically shifted traffic to the remaining two regions within seconds.
Layer 2: Stateless Service Replication
Your application servers need to run identically in both regions. This is the easiest layer if your services are truly stateless:
# ECS Service definition - deployed to both regions via CI/CD
# aws-ecs-service.yml
AWSTemplateFormatVersion: '2010-09-09'
Resources:
ApiService:
Type: AWS::ECS::Service
Properties:
Cluster: !Ref ECSCluster
DesiredCount: 4
TaskDefinition: !Ref ApiTaskDefinition
LoadBalancers:
- ContainerName: api
ContainerPort: 8080
TargetGroupArn: !Ref ApiTargetGroup
DeploymentConfiguration:
MinimumHealthyPercent: 75
MaximumPercent: 200
# Auto-scaling to absorb failover traffic
ServiceScalingTarget:
MinCapacity: 4
MaxCapacity: 20
The key detail: each region must have enough spare capacity to absorb the other region’s traffic during failover. If both regions run at 50% capacity during normal operation, either can handle 100% during a failure. This is the “N+1 at the region level” principle.
Layer 3: Data Replication
This is where multi-region gets hard. Stateless services are easy to replicate. Data is not.
RTO (Recovery Time Objective) is how long you can be down. RPO (Recovery Point Objective) is how much data you can afford to lose. These two numbers determine your replication strategy.
# Aurora Global Database - async replication, sub-second lag
resource "aws_rds_global_cluster" "main" {
global_cluster_identifier = "example-global-cluster"
engine = "aurora-postgresql"
engine_version = "14.5"
database_name = "app_production"
}
resource "aws_rds_cluster" "primary" {
cluster_identifier = "example-cluster-primary"
engine = "aurora-postgresql"
engine_version = "14.5"
global_cluster_identifier = aws_rds_global_cluster.main.id
master_username = "admin"
master_password = var.db_password
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
}
resource "aws_rds_cluster" "secondary" {
provider = aws.west
cluster_identifier = "example-cluster-secondary"
engine = "aurora-postgresql"
engine_version = "14.5"
global_cluster_identifier = aws_rds_global_cluster.main.id
availability_zones = ["us-west-2a", "us-west-2b", "us-west-2c"]
# This cluster starts as a reader - promoted on failover
depends_on = [aws_rds_cluster.primary]
}
Data replication lag is the gap between what’s written in the primary region and what’s available in the secondary. Aurora Global Database typically maintains under 1 second of lag, meaning your RPO is roughly 1 second of transactions during an unplanned failover.
Real-World
DynamoDB Global Tables provide active-active replication with eventual consistency. Writes in any region propagate to all other regions within roughly 1 second. Stripe uses DynamoDB Global Tables for their idempotency keys - ensuring that a payment processed in us-east-1 won’t be double-processed in us-west-2 during failover.
The Full Architecture
The happy path works like this: a user in New York hits api.example.com. Route 53 resolves this to the us-east-1 ALB based on latency routing. The request hits an ECS container, which reads from the local Aurora reader and ElastiCache instance. Writes go to the Aurora primary writer in us-east-1. Aurora replicates those writes to us-west-2 asynchronously.
A user in San Francisco hits the same domain. Route 53 routes them to us-west-2. Their reads are served locally. Their writes still go to the Aurora primary in us-east-1 (unless you’re using DynamoDB Global Tables, which accept writes in any region).
When us-east-1 fails: Route 53 health checks detect the failure within 30 seconds. DNS updates route all traffic to us-west-2. The Aurora secondary cluster promotes to primary (roughly 30 seconds). West coast users see no interruption. East coast users experience 30-90 seconds of errors during DNS propagation, then resume normally.
Component Deep Dives
Failover Automation
Manual failover at 2 AM is a disaster recipe. The automation needs to handle detection, decision, and execution without human input:
# Lambda function triggered by CloudWatch alarm
import boto3
import time
def handler(event, context):
"""Automated failover orchestrator."""
route53 = boto3.client('route53')
rds = boto3.client('rds', region_name='us-west-2')
# Step 1: Verify the outage is real (avoid flapping)
if not verify_sustained_failure(event):
return {"status": "transient", "action": "none"}
# Step 2: Promote Aurora secondary to primary
rds.failover_global_cluster(
GlobalClusterIdentifier='example-global-cluster',
TargetDbClusterIdentifier='arn:aws:rds:us-west-2:123456789:cluster:example-cluster-secondary'
)
# Step 3: Wait for promotion to complete
waiter = rds.get_waiter('db_cluster_available')
waiter.wait(DBClusterIdentifier='example-cluster-secondary')
# Step 4: Update Route 53 if not using automatic failover
# (If using failover routing policy, this happens automatically)
# Step 5: Scale up secondary region
ecs = boto3.client('ecs', region_name='us-west-2')
ecs.update_service(
cluster='production',
service='api-service',
desiredCount=20 # absorb full traffic
)
return {"status": "failover_complete", "timestamp": time.time()}
def verify_sustained_failure(event):
"""Require 3 consecutive failures before triggering failover."""
cloudwatch = boto3.client('cloudwatch', region_name='us-east-1')
# Check last 3 data points of health check metric
response = cloudwatch.get_metric_data(
MetricDataQueries=[{
'Id': 'health',
'MetricStat': {
'Metric': {
'Namespace': 'AWS/Route53',
'MetricName': 'HealthCheckStatus',
'Dimensions': [{'Name': 'HealthCheckId', 'Value': event['health_check_id']}]
},
'Period': 10,
'Stat': 'Minimum'
}
}],
StartTime=time.time() - 60,
EndTime=time.time()
)
failures = [v for v in response['MetricDataResults'][0]['Values'] if v == 0]
return len(failures) >= 3
Cross-Region Cache Warming
An empty cache in the failover region means your database gets hammered with the full request volume immediately after failover. Pre-warming solves this:
# Cache warming service running in secondary region
import redis
import json
import boto3
class CacheWarmer:
def __init__(self, primary_region='us-east-1', secondary_region='us-west-2'):
self.kinesis = boto3.client('kinesis', region_name=primary_region)
self.local_cache = redis.Redis(host='elasticache.us-west-2.amazonaws.com')
def process_replication_stream(self):
"""Consume cache invalidation events and pre-warm locally."""
shard_iterator = self.kinesis.get_shard_iterator(
StreamName='cache-invalidation-stream',
ShardId='shardId-000000000000',
ShardIteratorType='LATEST'
)['ShardIterator']
while True:
response = self.kinesis.get_records(
ShardIterator=shard_iterator,
Limit=100
)
for record in response['Records']:
event = json.loads(record['Data'])
self.local_cache.set(
event['key'],
event['value'],
ex=event['ttl']
)
shard_iterator = response['NextShardIterator']
Health Check Endpoint
The health check isn’t just “is the process running” - it verifies the full dependency chain:
// Health check endpoint that validates full stack
func healthHandler(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
defer cancel()
checks := map[string]error{
"database": checkDatabase(ctx),
"cache": checkRedis(ctx),
"queue": checkSQS(ctx),
}
allHealthy := true
results := make(map[string]string)
for name, err := range checks {
if err != nil {
allHealthy = false
results[name] = err.Error()
} else {
results[name] = "ok"
}
}
if !allHealthy {
w.WriteHeader(http.StatusServiceUnavailable)
}
json.NewEncoder(w).Encode(results)
}
Design Insight
Your health check endpoint determines your failover sensitivity. Too shallow (just return 200) and you’ll route traffic to a region that’s technically running but can’t serve requests. Too deep (check every downstream dependency) and a single slow query triggers unnecessary failovers.
Comparison Table
| Approach | RTO | RPO | Cost Multiplier | Complexity | Best For |
|---|---|---|---|---|---|
| Single Region (Multi-AZ) | Depends on AWS | 0 (synchronous within region) | 1x | Low | Non-critical apps, early startups |
| Active-Passive (Cold) | 15-45 min | 1-5 min (async repl) | 1.3x | Medium | Moderate SLAs, cost-sensitive |
| Active-Passive (Warm) | 5-15 min | Under 1 min | 1.6x | Medium-High | Business-critical with budget constraints |
| Active-Active (Read local, Write primary) | 30-90 sec | Under 1 sec | 2x | High | High-availability SaaS, fintech |
| Active-Active (Multi-writer) | Under 10 sec | ~0 (conflict resolution) | 2.2x+ | Very High | Global apps, gaming, real-time systems |
| Cell-Based Architecture | Under 5 sec (per cell) | 0 (isolated cells) | 2.5x+ | Extreme | Hyperscale (AWS, Azure internal) |
Key Takeaways
- Multi-region is insurance, not optimization. You’re paying 2x infrastructure cost for the 0.1% of the time a region fails. The math only works if that 0.1% would cost you more than the infrastructure premium.
- Active-active eliminates the “untested failover” problem. If both regions serve real traffic daily, you know they work. Cold standbys rot.
- Data replication lag is your RPO. Aurora Global Database gives you sub-second RPO. DynamoDB Global Tables give you eventual consistency with last-writer-wins conflict resolution. Choose based on your tolerance for data loss.
- DNS TTL is your floor for RTO. No matter how fast your failover automation runs, clients with cached DNS records won’t see the change until TTL expires. Keep Route 53 TTLs at 60 seconds or less for failover records.
- Failover automation must be tested regularly. Netflix runs “region evacuation” drills monthly. If your failover hasn’t been tested in the last 30 days, it doesn’t work.
- Cache warming is the hidden gotcha. A failover that redirects traffic to a region with an empty cache just moves the outage from “region down” to “database overwhelmed.” Pre-warm or accept the cold-start penalty.
- Global load balancing is the entry point. Without Route 53 health checks or Global Accelerator, nothing else matters - you have no mechanism to redirect traffic away from a failed region.
- Cost scales linearly, complexity scales exponentially. Going from 1 region to 2 regions doesn’t double your engineering effort - it quadruples it due to data consistency, deployment coordination, and testing requirements.
Multi-region isn’t something you bolt on during an outage at 2 AM. It’s a foundational architecture decision that shapes how you build services, how you handle data, and how you deploy code. The teams that survive regional outages planned for them months before they happened.
FAQ
Q: Can’t I just use multiple Availability Zones instead of multiple regions?
AZs share a regional control plane. When the region’s control plane fails (as happened in the 2021 us-east-1 outage), all AZs in that region are affected. Multi-AZ protects against single data center failures - hardware issues, power outages, local networking problems. Multi-region protects against the entire regional infrastructure going down.
Q: How do I handle database writes in an active-active setup?
Two approaches: single-writer (all writes route to one region’s database, with async replication to the secondary) or multi-writer (DynamoDB Global Tables, CockroachDB, or application-level conflict resolution). Single-writer is simpler but adds write latency for users far from the primary. Multi-writer eliminates that latency but introduces conflict resolution complexity.
Q: What’s the minimum viable multi-region setup for a startup?
Route 53 failover routing + Aurora Global Database + identical ECS services in two regions. You can start with active-passive (secondary region at minimal capacity) and graduate to active-active once traffic justifies the cost. Budget roughly 1.5x your current infrastructure spend for the passive approach.
Q: How do I handle in-flight requests during failover?
They fail. That’s the reality. Any request in-flight when the region goes down will timeout. Your clients need retry logic with exponential backoff. The goal isn’t zero dropped requests - it’s minimizing the window where requests fail. With Route 53 failover and 60-second TTLs, that window is 30-90 seconds for most clients.
Q: Does multi-region mean my deployments are twice as complex?
Yes, but CI/CD automation absorbs most of that complexity. Deploy to both regions in parallel (not sequentially). Use feature flags to control rollout. If a deployment breaks one region, the other continues serving traffic while you roll back. Multi-region actually gives you safer deployments if you do canary releases region-by-region.
Q: How does data replication lag affect user experience?
If a user writes data in us-east-1 and immediately reads from us-west-2, they might not see their own write. This is the “read-your-own-writes” consistency problem. Solutions include sticky sessions (route a user to the same region consistently), read-from-primary for critical paths, or versioned reads where the client includes the last-known version in read requests.
Interview Questions
Q: Design a multi-region failover system for a payment processing platform. What are your RTO and RPO requirements, and how do they influence your architecture?
Expected depth: Discuss why payments require near-zero RPO (you can’t lose transaction records), how this pushes toward synchronous replication or event sourcing with replay capability. Cover idempotency keys for preventing double-charges during failover. Mention that RTO for payments is often contractually bound (SLA with merchants) and typically needs to be under 60 seconds.
Q: How would you handle the “split-brain” scenario where both regions believe they are the primary writer?
Expected depth: Explain that split-brain occurs when regions lose connectivity but both remain operational. Discuss fencing tokens, leader election via a third-party consensus service (e.g., DynamoDB lock table in a third region), and why “fail-closed” (refuse writes) is often safer than “fail-open” (accept writes and reconcile later) for financial systems.
Q: Compare active-active vs active-passive for a social media feed. Which would you recommend and why?
Expected depth: Social media feeds are read-heavy and eventually-consistent. Active-active with local reads is ideal because read latency matters more than strict consistency. Discuss how a user seeing a 1-second-stale feed is acceptable, but a payment being 1-second-stale is not. Cover the write-routing problem for posts (single-writer vs conflict resolution).
Q: Your multi-region system has a data replication lag of 800ms during peak hours. A user writes a comment in Region A and immediately refreshes in Region B. They don’t see their comment. How do you solve this without making all reads go to the primary?
Expected depth: Discuss session affinity (route same user to same region), causal consistency tokens (client sends write-timestamp, reader waits for replication to catch up), read-your-own-writes middleware (check primary only for the user who just wrote), and the tradeoff between consistency and latency at each solution.
Q: Walk me through a regional failover drill. What do you test, how do you minimize blast radius, and what metrics tell you the drill succeeded?
Expected depth: Cover synthetic traffic injection, percentage-based traffic shifting (start by routing 1% away, then 10%, then 100%), monitoring for error rate spikes during the shift, database promotion timing, cache hit rate in the secondary region, and the criteria for declaring the drill successful (error rate stays below SLA threshold, latency stays within bounds, no data loss detected post-drill).
Premium Content
Unlock the full article along with everything else in the archive — all in one place.