Multi-Region Architecture: Building Systems That Survive Regional Failures
AWS US-East-1 goes down. It has happened before - in 2011, 2012, 2017, 2021. If your entire application runs in one region, you are down too. Your users cannot access your service. Your revenue stops. Your SLA is violated.
Multi-region architecture distributes your system across multiple geographic regions. A regional failure affects only part of your system. The rest continues serving users.
Why multi-region
Disaster recovery: A regional failure (power outage, natural disaster, cloud provider incident) does not take down your entire service.
Lower latency: Serve users from the region closest to them. A user in Singapore gets 5ms latency from a Singapore region instead of 200ms from US-East.
Data residency: Some regulations (GDPR, data sovereignty laws) require data to be stored in specific geographic regions.
Compliance: Some industries require geographic redundancy for business continuity.
Multi-region patterns
Active-passive (primary-secondary)
One region is active (handles all traffic). Other regions are passive (standby, ready to take over). Traffic is routed to the active region. If the active region fails, traffic is switched to a passive region.
Failover time: Minutes to hours (depending on automation and data replication lag).
Data consistency: Strong (all writes go to the primary region, replicated to secondaries).
Cost: Lower (passive regions have minimal resources).
Use for: Applications where brief downtime is acceptable, data consistency is critical.
graph TB subgraph active_passive["Active-Passive"] DNS1["DNS points to US-East"] US_E["US-East ACTIVE All traffic"] EU_W["EU-West PASSIVE Standby"] US_E -->|"async replication"| EU_W DNS1 --> US_E FAIL["US-East fails"] -->|"DNS failover 5-15 minutes"| EU_W end subgraph active_active["Active-Active"] DNS2["GeoDNS routes by location"] US_E2["US-East ACTIVE US users"] EU_W2["EU-West ACTIVE EU users"] DNS2 -->|"US users"| US_E2 DNS2 -->|"EU users"| EU_W2 US_E2 <-->|"sync replication"| EU_W2 end style US_E fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style EU_W fill:#F1EFE8,stroke:#888780,color:#444441 style US_E2 fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style EU_W2 fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style FAIL fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
Active-active
Multiple regions are active simultaneously. Traffic is distributed across regions (by geography, by load, or both). Each region handles a subset of traffic.
Failover time: Seconds (DNS TTL or load balancer health check).
Data consistency: Eventual (writes in one region must replicate to others).
Cost: Higher (all regions have full capacity).
Use for: Applications requiring high availability and low latency globally.
Read-local, write-global
A hybrid approach. Reads are served from the local region (low latency). Writes go to a primary region (strong consistency). Reads might be slightly stale (replication lag).
Use for: Read-heavy applications where brief read staleness is acceptable (social media feeds, product catalogs).
The data challenge
The hardest part of multi-region architecture is data. Compute is easy to replicate. Data is not.
Database replication across regions
Synchronous replication: Every write must be acknowledged by all regions before returning success. Strong consistency. High write latency (cross-region RTT is 50-200ms).
Asynchronous replication: Writes are acknowledged locally, replicated to other regions in the background. Low write latency. Eventual consistency. Risk of data loss if the primary region fails before replication completes.
Semi-synchronous: One remote region must acknowledge before success. Balance between latency and durability.
Conflict resolution
In active-active, two regions might accept conflicting writes simultaneously. User A updates their profile in US-East. User B (or the same user on a different device) updates it in EU-West. Both writes succeed. When they replicate, you have a conflict.
Resolution strategies:
- Last-write-wins: The write with the higher timestamp wins. Simple but loses data.
- Application-level merge: Surface the conflict to the application. Complex but no data loss.
- CRDTs: Data structures that merge automatically. Works for specific data types (counters, sets).
Global databases
Some databases handle multi-region replication natively:
- Google Spanner: Globally distributed SQL with external consistency. Writes are globally consistent. High write latency (100-200ms for global transactions).
- CockroachDB: Distributed SQL with multi-region support. Configurable consistency per table.
- DynamoDB Global Tables: Multi-region active-active with eventual consistency.
- Cassandra: Multi-datacenter replication with tunable consistency.
graph LR subgraph data_patterns["Data Replication Patterns"] SYNC["Synchronous Strong consistency High write latency 100-200ms cross-region"] ASYNC["Asynchronous Eventual consistency Low write latency Risk of data loss"] SEMI["Semi-synchronous One remote ACK Balance of both"] end subgraph use2["Use For"] U1["Financial data Inventory User accounts"] U2["Activity feeds Product catalog Analytics"] U3["Most production use cases"] end SYNC --- U1 ASYNC --- U2 SEMI --- U3 style SYNC fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style ASYNC fill:#E1F5EE,stroke:#0F6E56,color:#085041 style SEMI fill:#FAEEDA,stroke:#854F0B,color:#633806
Where it breaks or gets interesting
DNS failover latency
DNS TTL determines how quickly traffic shifts during failover. A 5-minute TTL means it takes up to 5 minutes for all clients to see the new DNS record. During that window, some clients still go to the failed region.
Use low TTLs (60 seconds) for critical services. Use anycast routing (Cloudflare, AWS Global Accelerator) for near-instant failover.
The split-brain problem
In active-active, if the network between regions is severed, both regions might think the other is down and both try to be the primary. Both accept writes. When the network heals, you have conflicting data.
Prevention: use a consensus protocol for leader election (only one region can be primary at a time), or design the system to handle conflicts (CRDTs, last-write-wins).
Partial failures
A region might be partially degraded (some services down, not all). Traffic routing must be smart enough to detect partial failures and route around them. Use health checks at the service level, not just the region level.
Cross-region latency for synchronous operations
If your application makes synchronous cross-region calls (service in US-East calls database in EU-West), every request pays the cross-region latency penalty (100-200ms). Design to minimize cross-region synchronous calls. Use local caches, local read replicas, and async replication.
Real-world systems
Netflix - Active-active across multiple AWS regions. Uses Cassandra for multi-region data replication. Chaos Kong tests regional failover by taking down entire regions.
Amazon - Runs across multiple regions and availability zones. S3, DynamoDB, and other services are designed for multi-region operation.
Cloudflare - 300+ PoPs globally. Anycast routing sends users to the nearest PoP. Data is replicated globally.
Google - Spanner provides globally consistent transactions. Used for Google’s own services and available to customers.
Stripe - Active-passive with fast failover. Primary region handles all writes. Secondary region is a hot standby.
How to apply it in practice
Start with multi-AZ, then multi-region
Multi-AZ (multiple availability zones within one region) is simpler than multi-region and protects against most failures. Start there. Add multi-region when you need:
- Protection against regional failures
- Lower latency for global users
- Data residency requirements
The multi-region checklist
Before going multi-region:
- Stateless application layer: Application servers must be stateless (sessions in Redis, files in S3)
- Database replication: Choose a replication strategy (sync, async, semi-sync)
- Conflict resolution: Define how to handle conflicting writes
- DNS and traffic routing: GeoDNS or anycast for routing, low TTLs for failover
- Monitoring: Region-level health checks and dashboards
- Runbooks: Documented procedures for regional failover
- Testing: Regular failover drills (chaos engineering)
Cost considerations
Multi-region doubles (or more) your infrastructure cost. Optimize:
- Use active-passive for non-critical services (passive regions have minimal resources)
- Use CDN for static content (cheaper than running full regions)
- Use spot/preemptible instances for stateless compute
- Replicate only the data that needs to be in each region
FAQ
Q: What is the difference between multi-region and multi-AZ?
Availability zones (AZs) are physically separate data centers within a region, connected by low-latency links. They protect against data center failures. Regions are geographically separate (different cities or countries). They protect against regional disasters (power grid failures, natural disasters, cloud provider regional incidents). Multi-AZ is simpler and cheaper. Multi-region is more complex but provides stronger guarantees.
Q: How do you handle user sessions in a multi-region active-active setup?
Store sessions in a distributed cache (Redis with multi-region replication, or DynamoDB Global Tables). When a user’s request is routed to a different region (due to failover or load balancing), the new region can read the session from the distributed cache. Alternatively, use JWTs (stateless tokens) that do not require a session store. The JWT contains the user’s identity and is verified locally by any region.
Q: What is RTO and RPO?
RTO (Recovery Time Objective) is the maximum acceptable time to restore service after a failure. RPO (Recovery Point Objective) is the maximum acceptable data loss (measured in time). For example: RTO = 1 hour (service must be restored within 1 hour), RPO = 5 minutes (at most 5 minutes of data can be lost). These objectives drive your architecture choices: lower RTO requires faster failover (active-active), lower RPO requires more frequent replication (synchronous or near-synchronous).
Interview questions
Q1: Design a multi-region architecture for a payment processing system. What are the key considerations?
Strong answer: Payments require strong consistency (no double charges, no lost transactions). Use active-passive with synchronous replication for the database. The primary region handles all writes. The secondary region is a hot standby with synchronous replication (every write is acknowledged by both regions before returning success). This adds 100-200ms to write latency but guarantees no data loss. For reads, use the local region (low latency). For failover: use a consensus service (etcd) to manage primary election. Failover is automated and takes 30-60 seconds. Use a global load balancer (AWS Global Accelerator) for near-instant traffic routing. RPO: 0 (synchronous replication, no data loss). RTO: 60 seconds (automated failover). The tradeoff: higher write latency (100-200ms) for strong consistency.
Q2: Your application is active-active across US-East and EU-West. A user updates their profile in US-East. 500ms later, they read their profile from EU-West (their request was routed there due to load balancing). They see the old profile. How do you fix this?
Strong answer: This is the read-your-writes consistency problem in a multi-region setup. Solutions: sticky routing (route all requests from a user to the same region for a short window after a write), replication position tokens (after a write, return the replication position; on subsequent reads, pass the token; the EU-West region checks if it has caught up to that position before serving the read), or synchronous replication for user profile writes (accept higher write latency for this specific data). The simplest solution for most applications: sticky routing with a 5-second window. After a write, set a cookie that routes the user to the same region for 5 seconds. After that, normal routing resumes. This handles the common case (user immediately sees their own update) without requiring synchronous replication.
Q3: How do you test multi-region failover without causing a real outage?
Strong answer: Use chaos engineering. Netflix’s Chaos Kong randomly takes down entire AWS regions in production to verify that failover works. For most teams, start smaller: test failover in a staging environment that mirrors production. Simulate a regional failure by blocking all traffic to one region (using firewall rules or DNS changes). Verify that traffic automatically routes to the other region. Measure the failover time (RTO). Check that no data was lost (RPO). Run these tests regularly (monthly or quarterly). Gradually increase the scope: start with a single service, then a full region. Document the runbook for manual failover in case automation fails. The goal is to make failover a routine, practiced operation, not a panicked response to a real incident.