Availability and Fault Tolerance: Building Systems That Stay Up
It is 3am on a Tuesday. Your on-call phone goes off. The payment service is down. Not slow - down. Every request returning 503. You check the dashboard: one of your three database nodes crashed 4 minutes ago. The other two are healthy. But your application is configured to require all three nodes to agree before committing a write. So now nothing works.
You had redundancy. You had three nodes instead of one. But you did not have fault tolerance - the ability to keep operating when components fail. Redundancy is a prerequisite for fault tolerance, but it is not sufficient. How you handle failures is what determines whether your system stays up.
What availability actually means
Availability is the percentage of time a system is operational and serving requests correctly. It is usually expressed as “nines”:
- 99% - 3.65 days downtime per year
- 99.9% - 8.76 hours downtime per year
- 99.99% - 52.6 minutes downtime per year
- 99.999% - 5.26 minutes downtime per year
The jump from 99.9% to 99.99% is not “one more nine” - it is reducing downtime by 10x. Each additional nine is roughly 10x harder and more expensive to achieve.
Availability is calculated as: Availability = MTBF / (MTBF + MTTR)
Where MTBF is Mean Time Between Failures and MTTR is Mean Time To Recovery. You can improve availability by making failures less frequent (MTBF) or by recovering faster (MTTR). Most teams underinvest in MTTR.
Fault tolerance vs high availability
These terms are related but distinct:
High availability (HA) - The system minimizes downtime through redundancy and fast failover. There may be brief interruptions during failover, but they are short.
Fault tolerance - The system continues operating correctly even when components fail, with no interruption to users. True fault tolerance means users never see an error even during a failure.
Fault tolerance is harder and more expensive. Most systems aim for high availability (fast recovery) rather than true fault tolerance (no interruption).
graph TB subgraph ha["High Availability"] HA1["Primary fails"] HA2["Brief interruption<br/>10-30 seconds"] HA3["Failover to replica"] HA4["Service restored"] HA1 --> HA2 --> HA3 --> HA4 end subgraph ft["Fault Tolerant"] FT1["Primary fails"] FT2["Automatic reroute<br/>No interruption"] FT3["Users unaffected"] FT1 --> FT2 --> FT3 end style HA1 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F style HA2 fill:#FAEEDA,stroke:#854F0B,color:#633806 style HA3 fill:#FAEEDA,stroke:#854F0B,color:#633806 style HA4 fill:#E1F5EE,stroke:#0F6E56,color:#085041 style FT1 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F style FT2 fill:#E1F5EE,stroke:#0F6E56,color:#085041 style FT3 fill:#E1F5EE,stroke:#0F6E56,color:#085041
How fault tolerance is engineered
Redundancy patterns
Active-passive (primary-standby) - One node handles all traffic. The standby is ready to take over. Failover requires detecting the failure and promoting the standby. Downtime during failover: seconds to minutes.
Active-active - Multiple nodes handle traffic simultaneously. If one fails, the others absorb its load. No failover needed - traffic is already distributed. Requires the application to handle concurrent writes correctly.
N+1 redundancy - You have N nodes needed to handle load, plus 1 spare. If one fails, you are at capacity but still operational. Common for stateless services.
N+2 redundancy - Two spares. Can tolerate two simultaneous failures. Used for critical infrastructure.
Failure detection
You cannot fail over to a healthy node if you do not know a node is unhealthy. Detection mechanisms:
Health checks - Periodic HTTP requests to /health endpoints. If a node fails to respond within a timeout, it is marked unhealthy. Load balancers use this to stop routing traffic to failed nodes.
Heartbeats - Nodes send periodic “I am alive” signals to a coordinator. If the coordinator stops receiving heartbeats, it marks the node as failed. Used in distributed databases (Cassandra gossip, Kafka controller).
Timeouts - If a request to a dependency takes longer than N milliseconds, treat it as a failure. The timeout value is a critical tuning parameter - too short causes false positives, too long causes cascading failures.
Graceful degradation
When a component fails, the system should degrade gracefully rather than fail completely. This means:
- Return cached data instead of live data
- Disable non-critical features (recommendations, analytics) while keeping core features (checkout, login) working
- Return partial results rather than errors
Netflix’s Hystrix (now Resilience4j) popularized this pattern. When a downstream service is slow or failing, the circuit breaker opens and returns a fallback response immediately, preventing the failure from cascading.
graph LR subgraph normal["Normal Operation"] C1["Client"] --> LB1["Load Balancer"] LB1 --> S1["Server 1"] LB1 --> S2["Server 2"] LB1 --> S3["Server 3"] end subgraph failure["Server 2 Fails"] C2["Client"] --> LB2["Load Balancer"] LB2 --> S4["Server 1"] LB2 -.->|"health check fails"| S5["Server 2 DOWN"] LB2 --> S6["Server 3"] end style S5 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F style LB1 fill:#E1F5EE,stroke:#0F6E56,color:#085041 style LB2 fill:#E1F5EE,stroke:#0F6E56,color:#085041 style S1 fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style S2 fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style S3 fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style S4 fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style S6 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
Where it breaks or gets interesting
The availability math for dependent services
If your service depends on three other services, each with 99.9% availability, your availability is not 99.9%. It is 0.999 x 0.999 x 0.999 = 99.7%. Dependencies multiply failure probabilities.
This is why microservices architectures require careful thought about availability. A service with 10 dependencies, each at 99.9%, has a theoretical availability of 0.999^10 = 99% - worse than a single monolith.
The mitigation: make dependencies optional where possible. If the recommendation service is down, show no recommendations rather than failing the entire page.
Correlated failures are the real danger
Redundancy protects against independent failures. It does not protect against correlated failures - where multiple components fail for the same reason simultaneously.
Common correlated failure causes:
- Shared infrastructure - All nodes in the same availability zone. An AZ outage takes them all down.
- Shared configuration - A bad config deploy hits all nodes simultaneously.
- Thundering herd - All nodes restart at the same time after a deploy, overwhelming the database.
- Cascading failures - One slow service causes timeouts everywhere, which causes retries, which increases load, which makes everything slower.
The fix: spread redundant components across failure domains (AZs, regions). Use staggered deploys. Add jitter to retry logic.
The split-brain problem
In an active-active setup, if the network between two nodes is severed, both nodes might think the other is dead and both try to become the primary. Now you have two primaries accepting writes - a split-brain scenario. When the network heals, you have conflicting data.
Solutions: require a quorum (majority) of nodes to agree before accepting writes. With 3 nodes, you need 2 to agree. If a node cannot reach 2 others, it stops accepting writes. This is the CP choice in CAP theorem.
MTTR is often more impactful than MTBF
Most teams focus on preventing failures (MTBF). But for a given availability target, reducing MTTR from 60 minutes to 6 minutes has the same effect as making failures 10x less frequent. Investments in MTTR: automated failover, runbooks, on-call tooling, observability, chaos engineering to practice recovery.
Real-world systems and their approaches
AWS - Availability Zones are physically separate data centers within a region, connected by low-latency links. Deploying across 3 AZs protects against single AZ failures. Regions are geographically separate and protect against regional disasters.
Google Spanner - Uses Paxos consensus across multiple zones. Can tolerate zone failures with no data loss and minimal availability impact. Achieves 99.999% availability SLA.
Cassandra - Replication factor of 3 across 3 AZs. With LOCAL_QUORUM consistency, can tolerate one AZ failure with no impact. With QUORUM, can tolerate one AZ failure with brief latency increase.
Netflix - Runs Chaos Monkey in production, randomly terminating instances to ensure the system handles failures gracefully. Simian Army extends this to AZ failures (Chaos Gorilla) and region failures (Chaos Kong).
PostgreSQL with Patroni - Patroni uses etcd or ZooKeeper for leader election. If the primary fails, Patroni automatically promotes a replica and updates the DNS/VIP. Failover takes 10-30 seconds.
How to apply it in practice
Availability targets by component type
Not everything needs five nines. Match the availability target to the business impact:
- Core transaction path (checkout, login, payment): 99.99%+
- User-facing features (search, recommendations): 99.9%
- Internal tools (admin dashboards, analytics): 99.5%
- Batch jobs: measured in successful completion rate, not uptime
The redundancy checklist
For each component in your system:
- Single points of failure? Every component that can take down the system if it fails.
- Failure domain isolation? Are redundant components in separate AZs/regions?
- Automated failover? Does recovery require human intervention?
- Health check coverage? Is every component monitored with appropriate timeouts?
- Graceful degradation? What does the system do when this component is unavailable?
Chaos engineering
The only way to know your fault tolerance actually works is to test it. Chaos engineering means deliberately injecting failures in production (or a production-like environment) to verify the system handles them correctly.
Start small: kill a single instance and verify the load balancer routes around it. Then kill an AZ. Then simulate a slow dependency (latency injection). Each test reveals assumptions you did not know you were making.
FAQ
Q: What is the difference between availability and reliability?
Availability is the percentage of time the system is operational. Reliability is the probability that the system performs its intended function without failure over a given time period. A system can be highly available but unreliable - it is always up but frequently returns wrong results. A reliable system does what it is supposed to do correctly. For most practical purposes, you want both: the system is up (available) and returns correct results (reliable).
Q: How do you achieve 99.999% availability?
Five nines means 5.26 minutes of downtime per year. Achieving this requires: multi-region active-active deployment (so a full region failure does not cause downtime), automated failover with no human in the loop, zero-downtime deploys, comprehensive health checking, and extensive chaos testing. It also requires that your dependencies (DNS, CDN, cloud provider) also meet this bar. In practice, very few systems genuinely need five nines - the cost is enormous and most businesses can tolerate more downtime than they think.
Q: Should I use active-active or active-passive?
Active-active is better for availability (no failover time) and scales horizontally. But it requires your application to handle concurrent writes correctly, which is hard for stateful systems. Active-passive is simpler and safer for databases and stateful services. Use active-active for stateless services (web servers, API servers) and active-passive for stateful services (databases, caches) unless you have a specific reason to do otherwise.
Interview questions
Q1: Your service has three dependencies, each with 99.9% availability. What is your service’s availability and how do you improve it?
Strong answer: Theoretical availability is 0.999^3 = 99.7%, which is about 26 hours of downtime per year. To improve it: first, identify which dependencies are on the critical path vs optional. For optional dependencies (recommendations, analytics), implement fallbacks so their failure does not affect your service’s availability. For critical dependencies, work with those teams to improve their availability, or add caching layers that can serve stale data during outages. Also consider whether you can make synchronous dependencies asynchronous - if you can queue a request and process it later, a downstream failure becomes a delay rather than an error.
Q2: You are designing a payment processing system that needs 99.99% availability. Walk through your architecture.
Strong answer: 99.99% means 52 minutes of downtime per year. Key decisions: deploy across at least 2 AZs (ideally 3) with active-active for the stateless API layer. For the database, use a managed service like Aurora with multi-AZ replication and automatic failover, or CockroachDB for multi-region. Implement circuit breakers for all downstream dependencies (fraud detection, notification service) so their failures do not cascade. Use a message queue for operations that can be async (sending receipts) so they do not block the critical path. Implement idempotency keys so retries do not cause double charges. Test failover quarterly with chaos engineering. Monitor MTTR, not just uptime.
Q3: Explain the split-brain problem and how quorum-based systems solve it.
Strong answer: Split-brain occurs when a network partition causes two nodes to both believe they are the primary and both accept writes. When the partition heals, you have conflicting data with no clear winner. Quorum-based systems solve this by requiring a majority of nodes to agree before accepting writes. With 3 nodes, you need 2 (a quorum). If a node cannot reach 2 others, it refuses writes. This means during a partition, at most one partition (the one with the majority) can accept writes. The minority partition becomes read-only or unavailable. The tradeoff: you sacrifice availability (the minority partition cannot accept writes) to prevent split-brain. This is the CP choice in CAP theorem. etcd, ZooKeeper, and Raft-based systems all use this approach.