Consistent Hashing: Distributing Data Without Reshuffling Everything


You have a distributed cache with 4 nodes. You store keys using hash(key) % 4 to decide which node gets each key. It works great. Then you add a 5th node because traffic is growing. Now the formula is hash(key) % 5. Almost every key maps to a different node. Your cache hit rate drops to near zero. Every request goes to the database. Your database falls over.

This is the problem consistent hashing solves. It is one of those ideas that seems complex until you understand it, and then seems obvious.

What consistent hashing actually is

Consistent hashing is a technique for distributing keys across nodes such that when nodes are added or removed, only a small fraction of keys need to be remapped.

With naive modulo hashing (hash(key) % N), adding or removing one node remaps nearly all keys. With consistent hashing, adding or removing one node remaps only 1/N of keys on average.

The core idea: arrange both keys and nodes on a circular ring (a hash ring). Each key is assigned to the first node clockwise from its position on the ring.

graph TB
subgraph ring["Hash Ring - 3 Nodes"]
  N1["Node A
position 0"]
  N2["Node B
position 120"]
  N3["Node C
position 240"]
  K1["key1
position 30
maps to Node B"]
  K2["key2
position 150
maps to Node C"]
  K3["key3
position 270
maps to Node A"]
end

subgraph add["After Adding Node D at position 180"]
  M1["key1 at 30 - still Node B"]
  M2["key2 at 150 - now Node D"]
  M3["key3 at 270 - still Node A"]
  NOTE["Only keys between 120-180
need to move to Node D"]
end

style N1 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style N2 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style N3 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style K1 fill:#F1EFE8,stroke:#888780,color:#444441
style K2 fill:#F1EFE8,stroke:#888780,color:#444441
style K3 fill:#F1EFE8,stroke:#888780,color:#444441
style NOTE fill:#E1F5EE,stroke:#0F6E56,color:#085041

How it works - step by step

Step 1: Create the ring. Hash the range 0 to 2^32 (or 2^64) into a circle. Position 0 and position 2^32 are the same point.

Step 2: Place nodes on the ring. Hash each node’s identifier (IP address, hostname) to get its position on the ring. Node A might hash to position 100, Node B to position 200, Node C to position 300.

Step 3: Place keys on the ring. Hash each key to get its position. Key “user:123” might hash to position 150.

Step 4: Find the responsible node. Walk clockwise from the key’s position until you hit a node. That node owns the key. Key at position 150 walks clockwise and hits Node B at position 200.

Step 5: Handle node addition. Add Node D at position 175. Keys between 150 and 175 that previously walked to Node B now walk to Node D first. Only those keys need to move. Everything else is unchanged.

Step 6: Handle node removal. Remove Node B at position 200. Keys that were assigned to Node B now walk clockwise to Node C at position 300. Only Node B’s keys need to move.

The virtual nodes problem

Basic consistent hashing has a flaw: with few nodes, the ring is unevenly divided. Node A might own 60% of the ring, Node B 30%, Node C 10%. The load is not balanced.

The fix: virtual nodes (vnodes). Instead of placing each physical node once on the ring, place it many times (100-200 positions). Each physical node owns many small segments of the ring rather than one large segment. The distribution becomes much more even.

Virtual nodes also make adding capacity smoother. When you add a new physical node, it takes over some vnodes from each existing node, spreading the migration load evenly.

graph LR
subgraph basic["Basic - Uneven Distribution"]
  BA["Node A
60% of ring"]
  BB["Node B
30% of ring"]
  BC["Node C
10% of ring"]
end

subgraph vnode["Virtual Nodes - Even Distribution"]
  VA["Node A
A1, A2, A3, A4
~33% of ring"]
  VB["Node B
B1, B2, B3, B4
~33% of ring"]
  VC["Node C
C1, C2, C3, C4
~33% of ring"]
end

style BA fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style BB fill:#FAEEDA,stroke:#854F0B,color:#633806
style BC fill:#FAEEDA,stroke:#854F0B,color:#633806
style VA fill:#E1F5EE,stroke:#0F6E56,color:#085041
style VB fill:#E1F5EE,stroke:#0F6E56,color:#085041
style VC fill:#E1F5EE,stroke:#0F6E56,color:#085041

Where it breaks or gets interesting

Hotspots with skewed key distributions

Consistent hashing distributes keys evenly across the ring, but if some keys are accessed far more than others (a celebrity’s profile, a viral post), the node owning that key becomes a hotspot regardless of how evenly keys are distributed. The solution is application-level: cache hot keys separately, replicate them to multiple nodes, or add a random suffix to spread reads across replicas.

Replication across the ring

For fault tolerance, each key is typically stored on multiple nodes. The standard approach: store the key on the first N nodes clockwise from the key’s position (N is the replication factor). Cassandra uses this exact model - a replication factor of 3 means each key is on 3 consecutive nodes on the ring.

Consistent hashing is not the same as rendezvous hashing

Rendezvous hashing (highest random weight) is an alternative that achieves similar properties. For each key, compute a score for every node and pick the highest. When a node is removed, only keys that scored that node highest need to move. Simpler to implement but requires checking all nodes for each key lookup, which is O(N) vs O(log N) for a ring with binary search.

The ring lookup is O(log N)

Finding the responsible node requires finding the first node clockwise from a position. With a sorted array of node positions and binary search, this is O(log N). With 1000 virtual nodes, that is about 10 comparisons - negligible.

Real-world systems that use it

Cassandra - Uses consistent hashing with virtual nodes (vnodes) as its core partitioning strategy. Each node owns a set of token ranges on the ring. Adding a node automatically takes token ranges from existing nodes.

DynamoDB - Amazon’s original Dynamo paper (2007) described consistent hashing with virtual nodes. DynamoDB’s internal partitioning is based on this design.

Memcached client libraries - Client-side consistent hashing distributes cache keys across a pool of Memcached servers. When a server is added or removed, only a fraction of keys need to be re-fetched from the database.

Redis Cluster - Uses a variant called hash slots (16384 slots). Each node owns a range of slots. Slots can be migrated between nodes for rebalancing. Conceptually similar to consistent hashing with 16384 virtual nodes.

Nginx upstream hashing - hash $request_uri consistent uses consistent hashing to route requests to upstream servers, providing cache locality (the same URL always goes to the same upstream).

Chord DHT - The academic distributed hash table that formalized consistent hashing. Each node knows about its successor and a set of finger nodes for O(log N) key lookup.

How to apply it in practice

When to use consistent hashing

Use it when you have a pool of nodes that changes over time and you need to minimize disruption when the pool changes:

  • Distributed caches (Memcached, Redis Cluster)
  • Distributed databases with manual sharding
  • Load balancing with cache locality requirements
  • Distributed task queues where tasks should go to the same worker

When not to use it

If your node pool is static and never changes, simple modulo hashing is simpler and works fine. If you need strong consistency guarantees about which node owns which key (for distributed locking), consistent hashing alone is not enough - you need a consensus protocol on top.

Replication factor selection

For a cache: replication factor 1 is fine. Cache misses are expensive but not catastrophic.

For a database: replication factor 3 is the standard. Tolerates one node failure with no data loss. With quorum reads and writes (2 of 3), you get strong consistency.

For critical data: replication factor 5 across 3 availability zones. Tolerates an entire AZ failure.

FAQ

Q: How does consistent hashing handle the case where a node is temporarily unavailable vs permanently removed?

These are different operations. A temporary unavailability (network blip, restart) should not trigger key migration - you just route around the node until it recovers. A permanent removal triggers migration of that node’s keys to its successor. Most systems use a separate membership protocol (gossip, ZooKeeper) to distinguish between the two. Cassandra’s hinted handoff handles temporary unavailability: writes destined for a down node are stored as hints on another node and delivered when the node recovers.

Q: What is the difference between consistent hashing and range-based partitioning?

Consistent hashing assigns keys to nodes based on a hash of the key. The distribution is pseudo-random and even, but you cannot do range queries efficiently (keys with similar values are on different nodes). Range-based partitioning assigns contiguous key ranges to nodes (node 1 owns keys A-M, node 2 owns N-Z). Range queries are efficient, but you can get hotspots if writes are concentrated in one range (e.g., time-series data where all new writes go to the latest range). HBase and Bigtable use range-based partitioning. Cassandra uses consistent hashing.

Q: Can consistent hashing guarantee perfectly even load distribution?

No. It minimizes the disruption when nodes change, but the distribution depends on the hash function and the number of virtual nodes. With enough virtual nodes (150-200 per physical node), the distribution is close to even in practice. But skewed access patterns (hot keys) can still cause uneven load regardless of how keys are distributed.

Interview questions

Q1: You have a Memcached cluster with 5 nodes. You need to add 3 more nodes to handle increased traffic. How does consistent hashing minimize cache disruption?

Strong answer: With naive modulo hashing, changing from 5 to 8 nodes remaps almost all keys (only keys where hash(key) % 5 == hash(key) % 8 stay on the same node - about 5/8 = 62.5% move). With consistent hashing, adding 3 nodes to a ring of 5 means each new node takes over roughly 1/8 of the ring. About 3/8 = 37.5% of keys move total, and they move to the new nodes specifically. The other 62.5% of keys are completely unaffected. In practice with virtual nodes, the migration is even smoother - each new node takes small chunks from each existing node rather than one large chunk from one node. The cache hit rate drops temporarily as the moved keys are re-fetched from the database, but it recovers quickly.

Q2: Design the partitioning scheme for a distributed key-value store that needs to handle node additions and removals with minimal data movement.

Strong answer: Use consistent hashing with virtual nodes. Each physical node gets 150-200 virtual node positions on the hash ring. For replication, store each key on the next 3 nodes clockwise (replication factor 3). For reads and writes, use quorum (2 of 3) for strong consistency. When adding a node: assign it virtual node positions, it takes over the key ranges for those positions from the current owners, data migrates in the background. When removing a node: its virtual node positions are taken over by the next clockwise node, data migrates before the node is fully removed. Use a gossip protocol for membership so all nodes know the current ring state. This is essentially how Cassandra works.

Q3: A key-value store uses consistent hashing. One node is receiving 10x more traffic than the others. What are the possible causes and fixes?

Strong answer: Two possible causes. First, uneven ring distribution: the node owns a disproportionately large segment of the ring. Fix: add more virtual nodes for the other nodes to rebalance, or use a better hash function. Second, hot keys: a small number of very popular keys all happen to hash to this node’s range. The ring is balanced but the access pattern is not. Fix: identify the hot keys (they will show up in access logs or metrics). For read-heavy hot keys, add a local in-process cache in front of the distributed cache. For write-heavy hot keys, shard the key itself (append a random suffix 0-9 to spread writes across 10 keys, then aggregate on read). For extremely hot keys, consider replicating them to multiple nodes and load balancing reads across all replicas.