Leader Election: Coordinating Who Is in Charge
Your job scheduler runs on three servers for redundancy. All three start up and begin scheduling jobs. The same job runs three times. Your database gets hammered. Your users get triplicate emails.
You need exactly one server to be the scheduler at any time. When that server fails, another should take over. This is leader election.
What leader election solves
Leader election is the process by which a group of nodes selects one node to be the “leader” - the single node responsible for a specific task. The leader handles the task. Followers stand by, ready to take over if the leader fails.
Common uses:
- Primary database selection (one node accepts writes)
- Distributed job scheduling (one node runs cron jobs)
- Distributed lock management (one node grants locks)
- Shard management (one node assigns shards to workers)
- Kafka controller (one broker manages partition assignments)
The key requirement: at any time, at most one node believes it is the leader. Two nodes simultaneously believing they are the leader (split-brain) causes data corruption or duplicate work.
Leader election approaches
Consensus-based election (Raft/Paxos)
The most reliable approach. Use a consensus algorithm (Raft) to elect a leader. The leader is the node that wins a majority vote. Only one node can win a majority at a time.
How it works in Raft:
- All nodes start as followers
- A follower starts an election when it does not hear from a leader within the election timeout
- It becomes a candidate and requests votes from other nodes
- If it receives votes from a majority, it becomes the leader
- The leader sends periodic heartbeats to prevent new elections
Used by: etcd, ZooKeeper, CockroachDB, Consul.
External coordination service
Use an existing consensus service (etcd, ZooKeeper) to manage leader election. Your application nodes compete to acquire a lock or create an ephemeral node. The node that succeeds is the leader.
etcd-based election:
- All nodes try to create a key with a lease:
PUT /leader {node_id: "node-1"} with lease - Only one succeeds (etcd is linearizable)
- The winner is the leader
- The leader renews its lease periodically
- If the leader fails to renew, the lease expires and the key is deleted
- Other nodes detect the deletion and compete for the key again
ZooKeeper-based election:
- All nodes create ephemeral sequential znodes:
/election/node-0000000001,/election/node-0000000002 - The node with the lowest sequence number is the leader
- Each non-leader watches the node with the next lower sequence number
- If that node disappears (leader failed), the watcher becomes the new leader
graph TB subgraph etcd_election["etcd-Based Leader Election"] N1["Node 1"] -->|"PUT /leader node1 (lease)"| ETCD["etcd cluster"] N2["Node 2"] -->|"PUT /leader node2 (lease)"| ETCD N3["Node 3"] -->|"PUT /leader node3 (lease)"| ETCD ETCD -->|"node1 wins (linearizable)"| L["Node 1 = Leader"] ETCD -->|"failed"| F2["Node 2 = Follower watches /leader"] ETCD -->|"failed"| F3["Node 3 = Follower watches /leader"] L -->|"renew lease every 5s"| ETCD L -->|"crashes"| DEAD["Node 1 fails lease expires"] DEAD -->|"key deleted"| ETCD ETCD -->|"notify watchers"| NEW["Node 2 or 3 becomes leader"] end style L fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style ETCD fill:#E1F5EE,stroke:#0F6E56,color:#085041 style DEAD fill:#FCEBEB,stroke:#A32D2D,color:#791F1F style NEW fill:#EEEDFE,stroke:#534AB7,color:#3C3489
Bully algorithm
A simple election algorithm for systems without a consensus service. The node with the highest ID wins.
How it works:
- A node detects the leader has failed (no heartbeat)
- It sends an
Electionmessage to all nodes with higher IDs - If no higher-ID node responds, it declares itself the leader
- If a higher-ID node responds, it takes over the election
- The winner sends a
Coordinatormessage to all nodes
Problem: The bully algorithm does not prevent split-brain. If the network is partitioned, multiple nodes might declare themselves leader simultaneously.
Used for: Simple systems where split-brain is acceptable or can be handled at the application level.
The fencing token: preventing split-brain
Even with a good election algorithm, a “zombie leader” can cause problems. A leader that is paused (GC pause, slow disk) might miss heartbeats and be replaced. When it resumes, it still thinks it is the leader. Now you have two leaders.
The fencing token solves this. When a node becomes leader, it receives a monotonically increasing token. Every operation the leader performs must include this token. The protected resource (database, lock service) rejects operations with stale tokens.
Leader 1 elected: token = 5
Leader 1 paused (GC)
Leader 2 elected: token = 6
Leader 2 writes with token 6: accepted
Leader 1 resumes, writes with token 5: REJECTED (stale token)
This prevents the zombie leader from causing damage even if it thinks it is still the leader.
graph LR subgraph fencing["Fencing Token Pattern"] L1["Leader 1 token=5"] -->|"paused"| PAUSE["GC pause 30 seconds"] ETCD2["etcd lease expires"] -->|"elect new leader"| L2["Leader 2 token=6"] L2 -->|"write with token=6"| DB["Database accepts token 6"] PAUSE -->|"resumes still thinks leader"| L1 L1 -->|"write with token=5"| DB DB -->|"REJECTED token 5 < 6"| L1 end style L1 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F style L2 fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style DB fill:#E1F5EE,stroke:#0F6E56,color:#085041
Where it breaks or gets interesting
The election timeout tradeoff
Short election timeout: fast failover when the leader fails. But also more false elections (network hiccup causes unnecessary election).
Long election timeout: fewer false elections. But slow failover when the leader actually fails.
Typical values: 150-300ms for Raft in a LAN. Longer for WAN deployments.
Leader election in multi-region deployments
Cross-region latency (50-200ms) makes consensus-based election slow. A leader in US-East and a follower in EU-West: the follower’s election timeout must be longer than the cross-region RTT to avoid false elections.
Options: use a single region for the consensus cluster (accept that the other region cannot elect a leader during a partition), use a multi-region consensus algorithm (Raft with geo-distributed nodes), or use a different coordination mechanism for cross-region scenarios.
The thundering herd on leader failure
When the leader fails, all followers detect it simultaneously and start elections simultaneously. Multiple candidates split the vote. No one wins. New elections start. This can repeat for several rounds.
Raft mitigates this with randomized election timeouts: each follower waits a random time before starting an election. The first one to time out usually wins before others start.
Lease-based leadership
Instead of continuous heartbeats, use a lease: the leader holds a lease for a fixed duration (e.g., 10 seconds). During the lease, it is guaranteed to be the leader. Before the lease expires, it renews. If it cannot renew (crashed), the lease expires and a new leader is elected.
Lease-based leadership allows the leader to serve reads without contacting the consensus cluster on every read (within the lease duration). This reduces read latency.
Real-world systems
Kubernetes - Uses etcd for leader election of the controller manager and scheduler. Only one instance of each runs at a time. Uses the coordination.k8s.io/v1 Lease resource.
Kafka - Uses ZooKeeper (older versions) or KRaft (newer versions) for controller election. The controller manages partition assignments and broker failures.
PostgreSQL with Patroni - Patroni uses etcd, ZooKeeper, or Consul for leader election. The primary is the node that holds the leader key. Patroni handles automatic failover.
Redis Sentinel - Sentinel nodes vote to elect a new primary when the current primary fails. Requires a majority of Sentinel nodes to agree.
Elasticsearch - Uses a custom consensus algorithm for master node election. The master manages cluster state (index creation, shard assignment).
How to apply it in practice
Use an existing coordination service
Do not implement leader election from scratch. Use etcd, ZooKeeper, or Consul. They have been battle-tested and handle edge cases you will not think of.
For Kubernetes workloads: use the Kubernetes leader election library (k8s.io/client-go/tools/leaderelection). It uses Kubernetes Lease objects backed by etcd.
For non-Kubernetes workloads: use etcd’s election API or Consul’s session-based locking.
Always use fencing tokens
Even with a good election algorithm, implement fencing tokens for any resource the leader writes to. This is the defense against zombie leaders.
Monitor leader elections
Alert on:
- Frequent leader elections (indicates instability)
- Long election duration (indicates network issues)
- No leader elected (cluster is unavailable)
FAQ
Q: What is the difference between leader election and distributed locking?
They are closely related. A distributed lock is held by one node at a time. Leader election is the process of deciding which node holds the “leader lock.” In practice, leader election is often implemented using a distributed lock: the node that acquires the lock is the leader. The difference is semantic: a lock is for mutual exclusion of a resource, leader election is for selecting a coordinator.
Q: Can you do leader election without a consensus service?
Yes, but it is harder to get right. The bully algorithm works without a consensus service but does not prevent split-brain. For simple use cases (a single cron job that should run on one server), you can use a database row as a lock: UPDATE leader SET node_id = 'node-1', expires_at = NOW() + INTERVAL '30 seconds' WHERE expires_at < NOW(). The node that successfully updates the row is the leader. This works if the database is a single node (no distributed consensus needed). For distributed databases, you need the database to support atomic compare-and-swap operations.
Q: How long should a leader lease be?
The lease duration is a tradeoff between failover speed and false positives. A 10-second lease means: if the leader crashes, it takes up to 10 seconds for a new leader to be elected. A 30-second lease means slower failover but fewer false elections. For most applications, 10-30 seconds is appropriate. For latency-sensitive applications (real-time trading), shorter leases (1-5 seconds) with careful timeout tuning.
Interview questions
Q1: You are building a distributed job scheduler. Jobs should run exactly once, even if multiple scheduler instances are running. How do you implement this?
Strong answer: Use leader election to ensure only one scheduler instance is active at a time. Use etcd or ZooKeeper for the election. The elected leader runs the scheduler. Followers monitor the leader and are ready to take over. When the leader fails, a follower is elected and resumes scheduling. For the jobs themselves: use idempotency keys (job ID + scheduled time) to prevent duplicate execution if the leader fails mid-job and the new leader retries. Store job state in a database: pending, running, completed. Before running a job, atomically update its state from pending to running (using a conditional update). If two schedulers try to run the same job, only one succeeds. This handles the edge case where the leader fails after starting a job but before marking it complete.
Q2: Your leader election uses a 30-second lease. The leader experiences a 45-second GC pause. What happens?
Strong answer: The leader’s lease expires after 30 seconds. A new leader is elected. The new leader starts operating with a new fencing token (higher than the old leader’s token). After 45 seconds, the old leader resumes. It still thinks it is the leader. It tries to perform operations with its old fencing token. The protected resources reject these operations because the token is stale. The old leader eventually detects that its lease has expired (it tries to renew and fails, or it sees a higher term in the consensus cluster) and steps down to follower. The system is safe because of fencing tokens. Without fencing tokens, the old leader would have caused data corruption during the 15-second window when both leaders were active.
Q3: How does Kubernetes use leader election for its control plane components?
Strong answer: Kubernetes controller manager and scheduler run as multiple replicas for high availability, but only one replica should be active at a time. They use the Kubernetes leader election library, which creates a Lease object in the kube-system namespace. Each replica tries to acquire the lease by updating the Lease object with its identity and a renewal timestamp. The replica that successfully updates the lease is the leader. It renews the lease every renewDeadline seconds (default 10s). If it fails to renew within leaseDuration (default 15s), the lease expires and another replica acquires it. The retryPeriod (default 2s) controls how often non-leaders try to acquire the lease. This is backed by etcd, which provides the linearizable compare-and-swap needed for safe leader election.