Distributed Consensus: Getting Nodes to Agree on Anything

Three database nodes. A network partition splits them: node 1 is isolated, nodes 2 and 3 can still talk to each other. Node 1 thinks it is still the leader. Nodes 2 and 3 elect a new leader. Now you have two leaders. Both accept writes. When the partition heals, you have conflicting data with no clear winner.

This is the split-brain problem. Distributed consensus algorithms prevent it by ensuring that only one leader can exist at a time, and that any decision requires agreement from a majority of nodes.

What consensus means

Consensus is the problem of getting multiple nodes to agree on a single value, even when some nodes fail or messages are delayed. A consensus algorithm must satisfy:

Agreement: All non-faulty nodes decide on the same value.

Validity: The decided value was proposed by some node (not invented out of thin air).

Termination: All non-faulty nodes eventually decide.

Fault tolerance: The algorithm works correctly even if some nodes fail (crash-stop failures) or messages are delayed.

The FLP impossibility result (Fischer, Lynch, Paterson, 1985) proved that no deterministic consensus algorithm can guarantee termination in an asynchronous system with even one faulty node. In practice, algorithms work around this by using timeouts and randomization.

Paxos: the original consensus algorithm

Paxos (Lamport, 1989) was the first practical consensus algorithm. It is notoriously difficult to understand and implement correctly. Most production systems use Raft instead, but understanding Paxos helps understand why Raft was designed the way it was.

Paxos has two phases:

Phase 1 (Prepare):

A proposer picks a proposal number N (higher than any it has seen)
Proposer sends Prepare(N) to a majority of acceptors
Each acceptor responds with the highest-numbered proposal it has already accepted (if any)

Phase 2 (Accept):

If the proposer receives responses from a majority, it sends Accept(N, value) to acceptors
The value is either the highest-valued proposal from phase 1 responses, or the proposer’s own value if no previous proposals exist
If a majority of acceptors accept, the value is chosen

The key insight: any two majorities overlap. If a value is chosen by one majority, any future majority will include at least one node that knows about it.

Raft: consensus made understandable

Raft (Ongaro and Ousterhout, 2014) was designed to be more understandable than Paxos. It decomposes consensus into three sub-problems: leader election, log replication, and safety.

Leader election

Raft uses a leader-based approach. One node is the leader. All writes go through the leader. The leader replicates entries to followers.

Terms: Time is divided into terms. Each term begins with an election. If a leader is elected, it serves for the rest of the term. If no leader is elected (split vote), a new term begins.

Election process:

A follower starts an election when it does not hear from the leader within the election timeout (150-300ms)
It increments its term, becomes a candidate, votes for itself
It sends RequestVote to all other nodes
Nodes grant their vote if: the candidate’s term is at least as large as theirs, and the candidate’s log is at least as up-to-date as theirs
If the candidate receives votes from a majority, it becomes the leader
The new leader sends heartbeats to prevent new elections

graph TB
subgraph raft["Raft Leader Election"]
  F1["Follower 1
Election timeout"]
  F1 -->|"becomes candidate
term=2"| C1["Candidate 1"]
  C1 -->|"RequestVote(term=2)"| F2["Follower 2"]
  C1 -->|"RequestVote(term=2)"| F3["Follower 3"]
  F2 -->|"vote granted"| C1
  F3 -->|"vote granted"| C1
  C1 -->|"majority votes
becomes leader"| L1["Leader
term=2"]
  L1 -->|"heartbeats"| F2
  L1 -->|"heartbeats"| F3
end

style C1 fill:#FAEEDA,stroke:#854F0B,color:#633806
style L1 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style F2 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style F3 fill:#E1F5EE,stroke:#0F6E56,color:#085041

Log replication

The leader receives client requests, appends them to its log, and replicates to followers.

Client sends a command to the leader
Leader appends the command to its log (uncommitted)
Leader sends AppendEntries to all followers
Followers append the entry to their logs and respond
Once a majority have responded, the leader commits the entry
Leader applies the command to its state machine and responds to the client
Leader notifies followers of the commit in the next heartbeat

Safety guarantee: An entry is committed only when a majority of nodes have it in their logs. A new leader will always have all committed entries (because it must receive votes from a majority, and any majority overlaps with the commit majority).

Why Raft prevents split-brain

A node can only become leader if it receives votes from a majority. In a 5-node cluster, a majority is 3. If the cluster is partitioned into groups of 2 and 3, only the group of 3 can elect a leader. The group of 2 cannot reach a majority and cannot elect a leader. This prevents two leaders from existing simultaneously.

graph LR
subgraph partition["Network Partition - 5 Node Cluster"]
  subgraph majority["Majority partition (3 nodes)"]
    N1["Node 1
Leader"]
    N2["Node 2
Follower"]
    N3["Node 3
Follower"]
  end
  subgraph minority["Minority partition (2 nodes)"]
    N4["Node 4
Cannot elect leader
(no majority)"]
    N5["Node 5
Cannot elect leader
(no majority)"]
  end
end

style N1 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style N2 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style N3 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style N4 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style N5 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F

Where it breaks or gets interesting

The availability-consistency tradeoff

Raft requires a majority to make progress. In a 5-node cluster, you need 3 nodes. If 3 nodes fail, the cluster stops accepting writes. This is the CP choice in CAP theorem: consistency over availability.

For a 3-node cluster: can tolerate 1 failure. For a 5-node cluster: can tolerate 2 failures. For a 7-node cluster: can tolerate 3 failures. More nodes = more fault tolerance, but also more overhead (larger majorities needed).

Stale reads from followers

Followers might be slightly behind the leader. Reading from a follower might return stale data. For linearizable reads, read from the leader (or use a read index: the leader confirms it is still the leader before serving the read).

Leader lease reads

Confirming leadership on every read adds latency. Leader leases allow the leader to serve reads without confirmation for a bounded time period (the lease duration). If the leader’s lease has not expired, it knows it is still the leader. This reduces read latency at the cost of a small window where a stale leader might serve reads.

Byzantine fault tolerance

Raft and Paxos assume crash-stop failures: nodes either work correctly or stop. They do not handle Byzantine failures: nodes that behave maliciously or send incorrect messages. Byzantine fault-tolerant consensus (PBFT, Tendermint) requires 3f+1 nodes to tolerate f Byzantine failures. Used in blockchain systems.

Real-world systems

etcd - Uses Raft. The consensus store for Kubernetes. Stores cluster state, configuration, and service discovery data. Requires a majority of nodes to be available.

ZooKeeper - Uses ZAB (ZooKeeper Atomic Broadcast), similar to Paxos. Used by Kafka (older versions), HBase, and other distributed systems for coordination.

CockroachDB - Uses Raft per range (each range of data has its own Raft group). Globally consistent distributed SQL database.

Consul - Uses Raft for service discovery and configuration. Provides distributed locking and leader election.

TiKV - Uses Raft for distributed key-value storage. The storage layer for TiDB.

MongoDB - Uses Raft-like protocol for replica set elections. The primary is elected by a majority of replica set members.

How to apply it in practice

When you need consensus

You need consensus when:

Multiple nodes must agree on a single value (leader election, distributed lock)
You need linearizable reads and writes across multiple nodes
You need to coordinate distributed transactions

You do not need consensus when:

Eventual consistency is acceptable
You can use a single node (no distribution needed)
You can use a simpler coordination mechanism (optimistic locking, CAS operations)

Cluster sizing

Always use an odd number of nodes: 3, 5, or 7. An even number does not improve fault tolerance (a 4-node cluster can tolerate 1 failure, same as a 3-node cluster) but adds overhead.

3 nodes: Tolerates 1 failure. Minimum for production.
5 nodes: Tolerates 2 failures. Standard for critical systems.
7 nodes: Tolerates 3 failures. For very high availability requirements.

Operational considerations

Monitor leader elections (frequent elections indicate instability)
Monitor replication lag (followers falling behind)
Use separate disks for the consensus log (I/O contention can cause election timeouts)
Set election timeouts appropriately (too short = frequent elections, too long = slow failover)

FAQ

Q: What is the difference between Raft and Paxos?

Both solve the same problem (consensus) with similar guarantees. Raft was designed to be more understandable: it decomposes the problem into leader election, log replication, and safety, with clear rules for each. Paxos is more general and flexible but harder to implement correctly. In practice, most new systems use Raft. Older systems (ZooKeeper, Chubby) use Paxos variants. The performance characteristics are similar.

Q: Can a Raft cluster make progress with exactly half the nodes available?

No. A majority requires more than half. In a 4-node cluster, a majority is 3. If only 2 nodes are available, no majority can be formed and the cluster stops accepting writes. This is why odd-numbered clusters are preferred: a 3-node cluster requires 2 nodes (majority), and a 4-node cluster also requires 3 nodes (majority). The 4-node cluster provides no additional fault tolerance over the 3-node cluster.

Q: How does Raft handle a leader that is partitioned from the majority?

The partitioned leader continues to think it is the leader. It sends heartbeats, but they do not reach the majority. The majority partition detects the missing heartbeats (election timeout), elects a new leader, and continues accepting writes. The old leader cannot commit any new entries because it cannot reach a majority. When the partition heals, the old leader discovers the new term (higher than its own) and steps down to follower. Its uncommitted entries are overwritten by the new leader’s log.

Interview questions

Q1: You are designing a distributed lock service. How would you use Raft to implement it?

Strong answer: Use a Raft cluster (3 or 5 nodes) as the consensus store. To acquire a lock: send a command to the Raft leader to set lock:resource_name = {holder: client_id, expires_at: timestamp}. The command is committed only when a majority of nodes have it in their logs. To release: send a command to delete the lock entry. To prevent stale locks: include an expiry time. A background process on the leader periodically checks for expired locks and releases them. For fencing: include a monotonically increasing token in the lock grant. The lock holder includes this token in all operations. The protected resource rejects operations with stale tokens. This is how etcd’s distributed locking works.

Q2: A 5-node Raft cluster has 2 nodes fail simultaneously. What happens?

Strong answer: The cluster can still make progress. A majority of a 5-node cluster is 3. With 2 failures, 3 nodes remain. The 3 remaining nodes can elect a leader and commit entries. If the leader was one of the failed nodes, the remaining 3 nodes will detect the missing heartbeats, start an election, and elect a new leader from among themselves. The cluster continues operating normally. If a third node fails, the cluster stops accepting writes (only 2 nodes remain, cannot form a majority). Reads from followers might still work depending on the implementation, but writes are blocked until a node recovers.

Q3: How does Raft ensure that a newly elected leader has all committed entries?

Strong answer: Raft’s election safety property: a candidate can only win an election if its log is at least as up-to-date as any other node’s log. “Up-to-date” means: higher term in the last log entry, or same term but longer log. When a candidate sends RequestVote, each voter compares the candidate’s log to its own. If the candidate’s log is less up-to-date, the voter rejects the vote. Since committed entries have been replicated to a majority, and any majority overlaps with the voting majority, at least one voter in any winning election has all committed entries. The candidate that wins must have a log at least as up-to-date as that voter, so it has all committed entries. This is the key safety guarantee of Raft.