Multi-Region Architecture: Building Systems That Survive Regional Failures


AWS US-East-1 goes down. It has happened before - in 2011, 2012, 2017, 2021. If your entire application runs in one region, you are down too. Your users cannot access your service. Your revenue stops. Your SLA is violated.

Multi-region architecture distributes your system across multiple geographic regions. A regional failure affects only part of your system. The rest continues serving users.

Why multi-region

Disaster recovery: A regional failure (power outage, natural disaster, cloud provider incident) does not take down your entire service.

Lower latency: Serve users from the region closest to them. A user in Singapore gets 5ms latency from a Singapore region instead of 200ms from US-East.

Data residency: Some regulations (GDPR, data sovereignty laws) require data to be stored in specific geographic regions.

Compliance: Some industries require geographic redundancy for business continuity.

Multi-region patterns

Active-passive (primary-secondary)

One region is active (handles all traffic). Other regions are passive (standby, ready to take over). Traffic is routed to the active region. If the active region fails, traffic is switched to a passive region.

Failover time: Minutes to hours (depending on automation and data replication lag).

Data consistency: Strong (all writes go to the primary region, replicated to secondaries).

Cost: Lower (passive regions have minimal resources).

Use for: Applications where brief downtime is acceptable, data consistency is critical.

graph TB
subgraph active_passive["Active-Passive"]
  DNS1["DNS
points to US-East"]
  US_E["US-East
ACTIVE
All traffic"]
  EU_W["EU-West
PASSIVE
Standby"]
  US_E -->|"async replication"| EU_W
  DNS1 --> US_E
  FAIL["US-East fails"] -->|"DNS failover
5-15 minutes"| EU_W
end

subgraph active_active["Active-Active"]
  DNS2["GeoDNS
routes by location"]
  US_E2["US-East
ACTIVE
US users"]
  EU_W2["EU-West
ACTIVE
EU users"]
  DNS2 -->|"US users"| US_E2
  DNS2 -->|"EU users"| EU_W2
  US_E2 <-->|"sync replication"| EU_W2
end

style US_E fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style EU_W fill:#F1EFE8,stroke:#888780,color:#444441
style US_E2 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style EU_W2 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style FAIL fill:#FCEBEB,stroke:#A32D2D,color:#791F1F

Active-active

Multiple regions are active simultaneously. Traffic is distributed across regions (by geography, by load, or both). Each region handles a subset of traffic.

Failover time: Seconds (DNS TTL or load balancer health check).

Data consistency: Eventual (writes in one region must replicate to others).

Cost: Higher (all regions have full capacity).

Use for: Applications requiring high availability and low latency globally.

Read-local, write-global

A hybrid approach. Reads are served from the local region (low latency). Writes go to a primary region (strong consistency). Reads might be slightly stale (replication lag).

Use for: Read-heavy applications where brief read staleness is acceptable (social media feeds, product catalogs).

The data challenge

The hardest part of multi-region architecture is data. Compute is easy to replicate. Data is not.

Database replication across regions

Synchronous replication: Every write must be acknowledged by all regions before returning success. Strong consistency. High write latency (cross-region RTT is 50-200ms).

Asynchronous replication: Writes are acknowledged locally, replicated to other regions in the background. Low write latency. Eventual consistency. Risk of data loss if the primary region fails before replication completes.

Semi-synchronous: One remote region must acknowledge before success. Balance between latency and durability.

Conflict resolution

In active-active, two regions might accept conflicting writes simultaneously. User A updates their profile in US-East. User B (or the same user on a different device) updates it in EU-West. Both writes succeed. When they replicate, you have a conflict.

Resolution strategies:

  • Last-write-wins: The write with the higher timestamp wins. Simple but loses data.
  • Application-level merge: Surface the conflict to the application. Complex but no data loss.
  • CRDTs: Data structures that merge automatically. Works for specific data types (counters, sets).

Global databases

Some databases handle multi-region replication natively:

  • Google Spanner: Globally distributed SQL with external consistency. Writes are globally consistent. High write latency (100-200ms for global transactions).
  • CockroachDB: Distributed SQL with multi-region support. Configurable consistency per table.
  • DynamoDB Global Tables: Multi-region active-active with eventual consistency.
  • Cassandra: Multi-datacenter replication with tunable consistency.
graph LR
subgraph data_patterns["Data Replication Patterns"]
  SYNC["Synchronous
Strong consistency
High write latency
100-200ms cross-region"]
  ASYNC["Asynchronous
Eventual consistency
Low write latency
Risk of data loss"]
  SEMI["Semi-synchronous
One remote ACK
Balance of both"]
end

subgraph use2["Use For"]
  U1["Financial data
Inventory
User accounts"]
  U2["Activity feeds
Product catalog
Analytics"]
  U3["Most production
use cases"]
end

SYNC --- U1
ASYNC --- U2
SEMI --- U3

style SYNC fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style ASYNC fill:#E1F5EE,stroke:#0F6E56,color:#085041
style SEMI fill:#FAEEDA,stroke:#854F0B,color:#633806

Where it breaks or gets interesting

DNS failover latency

DNS TTL determines how quickly traffic shifts during failover. A 5-minute TTL means it takes up to 5 minutes for all clients to see the new DNS record. During that window, some clients still go to the failed region.

Use low TTLs (60 seconds) for critical services. Use anycast routing (Cloudflare, AWS Global Accelerator) for near-instant failover.

The split-brain problem

In active-active, if the network between regions is severed, both regions might think the other is down and both try to be the primary. Both accept writes. When the network heals, you have conflicting data.

Prevention: use a consensus protocol for leader election (only one region can be primary at a time), or design the system to handle conflicts (CRDTs, last-write-wins).

Partial failures

A region might be partially degraded (some services down, not all). Traffic routing must be smart enough to detect partial failures and route around them. Use health checks at the service level, not just the region level.

Cross-region latency for synchronous operations

If your application makes synchronous cross-region calls (service in US-East calls database in EU-West), every request pays the cross-region latency penalty (100-200ms). Design to minimize cross-region synchronous calls. Use local caches, local read replicas, and async replication.

Real-world systems

Netflix - Active-active across multiple AWS regions. Uses Cassandra for multi-region data replication. Chaos Kong tests regional failover by taking down entire regions.

Amazon - Runs across multiple regions and availability zones. S3, DynamoDB, and other services are designed for multi-region operation.

Cloudflare - 300+ PoPs globally. Anycast routing sends users to the nearest PoP. Data is replicated globally.

Google - Spanner provides globally consistent transactions. Used for Google’s own services and available to customers.

Stripe - Active-passive with fast failover. Primary region handles all writes. Secondary region is a hot standby.

How to apply it in practice

Start with multi-AZ, then multi-region

Multi-AZ (multiple availability zones within one region) is simpler than multi-region and protects against most failures. Start there. Add multi-region when you need:

  • Protection against regional failures
  • Lower latency for global users
  • Data residency requirements

The multi-region checklist

Before going multi-region:

  1. Stateless application layer: Application servers must be stateless (sessions in Redis, files in S3)
  2. Database replication: Choose a replication strategy (sync, async, semi-sync)
  3. Conflict resolution: Define how to handle conflicting writes
  4. DNS and traffic routing: GeoDNS or anycast for routing, low TTLs for failover
  5. Monitoring: Region-level health checks and dashboards
  6. Runbooks: Documented procedures for regional failover
  7. Testing: Regular failover drills (chaos engineering)

Cost considerations

Multi-region doubles (or more) your infrastructure cost. Optimize:

  • Use active-passive for non-critical services (passive regions have minimal resources)
  • Use CDN for static content (cheaper than running full regions)
  • Use spot/preemptible instances for stateless compute
  • Replicate only the data that needs to be in each region

FAQ

Q: What is the difference between multi-region and multi-AZ?

Availability zones (AZs) are physically separate data centers within a region, connected by low-latency links. They protect against data center failures. Regions are geographically separate (different cities or countries). They protect against regional disasters (power grid failures, natural disasters, cloud provider regional incidents). Multi-AZ is simpler and cheaper. Multi-region is more complex but provides stronger guarantees.

Q: How do you handle user sessions in a multi-region active-active setup?

Store sessions in a distributed cache (Redis with multi-region replication, or DynamoDB Global Tables). When a user’s request is routed to a different region (due to failover or load balancing), the new region can read the session from the distributed cache. Alternatively, use JWTs (stateless tokens) that do not require a session store. The JWT contains the user’s identity and is verified locally by any region.

Q: What is RTO and RPO?

RTO (Recovery Time Objective) is the maximum acceptable time to restore service after a failure. RPO (Recovery Point Objective) is the maximum acceptable data loss (measured in time). For example: RTO = 1 hour (service must be restored within 1 hour), RPO = 5 minutes (at most 5 minutes of data can be lost). These objectives drive your architecture choices: lower RTO requires faster failover (active-active), lower RPO requires more frequent replication (synchronous or near-synchronous).

Interview questions

Q1: Design a multi-region architecture for a payment processing system. What are the key considerations?

Strong answer: Payments require strong consistency (no double charges, no lost transactions). Use active-passive with synchronous replication for the database. The primary region handles all writes. The secondary region is a hot standby with synchronous replication (every write is acknowledged by both regions before returning success). This adds 100-200ms to write latency but guarantees no data loss. For reads, use the local region (low latency). For failover: use a consensus service (etcd) to manage primary election. Failover is automated and takes 30-60 seconds. Use a global load balancer (AWS Global Accelerator) for near-instant traffic routing. RPO: 0 (synchronous replication, no data loss). RTO: 60 seconds (automated failover). The tradeoff: higher write latency (100-200ms) for strong consistency.

Q2: Your application is active-active across US-East and EU-West. A user updates their profile in US-East. 500ms later, they read their profile from EU-West (their request was routed there due to load balancing). They see the old profile. How do you fix this?

Strong answer: This is the read-your-writes consistency problem in a multi-region setup. Solutions: sticky routing (route all requests from a user to the same region for a short window after a write), replication position tokens (after a write, return the replication position; on subsequent reads, pass the token; the EU-West region checks if it has caught up to that position before serving the read), or synchronous replication for user profile writes (accept higher write latency for this specific data). The simplest solution for most applications: sticky routing with a 5-second window. After a write, set a cookie that routes the user to the same region for 5 seconds. After that, normal routing resumes. This handles the common case (user immediately sees their own update) without requiring synchronous replication.

Q3: How do you test multi-region failover without causing a real outage?

Strong answer: Use chaos engineering. Netflix’s Chaos Kong randomly takes down entire AWS regions in production to verify that failover works. For most teams, start smaller: test failover in a staging environment that mirrors production. Simulate a regional failure by blocking all traffic to one region (using firewall rules or DNS changes). Verify that traffic automatically routes to the other region. Measure the failover time (RTO). Check that no data was lost (RPO). Run these tests regularly (monthly or quarterly). Gradually increase the scope: start with a single service, then a full region. Document the runbook for manual failover in case automation fails. The goal is to make failover a routine, practiced operation, not a panicked response to a real incident.