Build a Distributed Unique ID Generator


distributed-systems scalability reliability

System Design Deep Dive

Distributed Unique ID Generator

Millions of IDs per second, zero coordination, zero collisions - even when clocks lie

⏱ 14 min read📐 Advanced🏗️ Snowflake

Imagine every parcel leaving a warehouse needs a unique tracking number printed on the label before it goes onto the conveyor belt. The packing stations are scattered across twenty buildings, running at thousands of parcels per minute, with no time to phone a central office to ask “what is the next number?” The label must be printed now, it must never repeat, and it must sort chronologically so the tracking website can show you where your package is in the journey. This is exactly the problem a distributed unique ID generator solves - at software scale, across thousands of application nodes, with sub-millisecond latency requirements.

The naive solution is a database auto-increment column. It works perfectly up to a few thousand writes per second on a single primary. At 100,000 writes per second, a single Postgres instance doing SELECT nextval('id_seq') becomes the bottleneck for your entire platform. Every service that needs an ID must serialize through one database - and a network round-trip of even 1ms means you are capped at 1,000 IDs per second per connection. Add a few hundred services, each needing thousands of IDs per second, and the math collapses fast.

UUID v4 fixes the coordination problem by making each node generate random 128-bit identifiers independently. No collisions, no central server. But UUIDs are random, which means they are not time-sortable. When you insert UUID primary keys into a B-tree index (which is almost every database), random inserts cause page splits throughout the tree. At high write rates this turns sequential inserts into scattered random I/O, destroying your write throughput. UUID also wastes 64 bits compared to a 64-bit integer, and it is opaque - you cannot extract a timestamp from a UUID to understand when a record was created without storing an extra column.

The architecture we need must solve three tensions simultaneously. It must generate IDs with no coordination overhead so each node operates fully independently. It must produce IDs that are roughly time-sortable so databases can store them efficiently and engineers can reason about ordering. And it must guarantee global uniqueness across thousands of nodes in multiple datacenters, even when clocks skew, nodes crash mid-generation, and networks partition. We need to solve for coordination-free generation, monotonicity under clock drift, and cross-datacenter uniqueness simultaneously.

Requirements and Constraints

Functional Requirements

  • Generate globally unique 64-bit integer IDs on demand
  • IDs must be time-sortable: an ID generated later is numerically larger than one generated earlier (within the same millisecond, order is arbitrary but stable)
  • Each node generates IDs independently without contacting any other node at generation time
  • IDs must embed enough metadata to determine approximate creation time, source datacenter, and source node
  • Support batch generation (request N IDs in one call)

Non-Functional Requirements

  • Throughput: 4,096 IDs per millisecond per node (4.096 million IDs/sec per node), 4.19 billion IDs/sec across 1,024 nodes
  • Latency: ID generation must complete in under 1 microsecond (in-process) or under 2ms (RPC service mode)
  • Uniqueness: zero collisions guaranteed by construction, not probabilistically
  • Availability: 99.99% - ID generation must survive individual node failures with automatic failover
  • Ordering: IDs generated in the same millisecond on the same node are strictly ordered; IDs across nodes in the same millisecond are ordered within that node but unordered globally
  • Time range: support 69.7 years from a fixed epoch before the timestamp field overflows

Constraints

  • We assume NTP-synchronized clocks with maximum drift under 200ms between any two nodes
  • Machine IDs are pre-assigned at node startup via a coordination service (ZooKeeper or etcd) - this coordination happens once, not per ID
  • We do not guarantee strict global ordering across nodes (that would require coordination per ID, violating our throughput requirement)
  • The system does not support retrieval, lookup, or any read path - it is write-only generation

High-Level Architecture

Distributed unique ID generator architecture overview showing client tier, ID generator cluster, machine ID registry, clock layer, and downstream consumers

The system has four major layers. The ID Generator Cluster is the hot path - stateless worker processes, each with a pre-assigned machine ID, that generate IDs entirely in-process with no I/O. The Machine ID Registry is an etcd or ZooKeeper cluster responsible for assigning and tracking the 10-bit machine ID for each worker, using ephemeral leases so dead workers automatically release their IDs. The Clock Layer is not a separate service but a concern within each worker - logic that reads the system clock, detects backward drift, and enforces monotonicity. The Downstream Consumers are any services that use these IDs as primary keys, message keys, or sort keys.

On the request path, a caller invokes generateId() on its local ID generator library (embedded in-process) or via a lightweight RPC to a dedicated ID service. The library reads the current system clock in milliseconds, computes the offset from a fixed epoch, combines it with the pre-assigned machine ID and an incrementing sequence counter, and returns a 64-bit integer. No database, no network call, no lock contention with any other node. The only shared state is the in-process last_ts and sequence variables, protected by a single mutex or atomic CAS operation within one process.

At startup, the worker contacts the registry to claim a machine ID. It acquires an ephemeral lease on a numeric slot - if the process dies, the lease expires in 30 seconds and the slot is available to another worker. This one-time coordination cost at startup is the only moment where the generator touches external state. Every ID generated after that point uses only CPU registers and the system clock.

Key Insight

The Snowflake architecture achieves zero per-ID coordination by front-loading all coordination into startup: machine ID assignment happens once at boot via an ephemeral lease, and every subsequent ID generation is a pure in-process computation using only the clock and a counter.

The Snowflake ID Structure

The Snowflake algorithm, originally designed at Twitter in 2010 and now the blueprint for ID generation at Uber (UberID), Discord, Instagram, and Shopify, packs a 64-bit integer into four fields. Understanding the bit layout is key to understanding every tradeoff in the system.

Snowflake 64-bit ID bit field structure showing sign bit, 41-bit timestamp, 5-bit datacenter ID, 5-bit node ID, and 12-bit sequence number

The leftmost bit (bit 63) is the sign bit, always set to 0, ensuring IDs are positive signed 64-bit integers in every language that has signed integer types (which is most of them). Storing a negative ID in a Java long or a PostgreSQL BIGINT that enforces non-negative values would cause failures, so we sacrifice one bit for safety.

The next 41 bits are the epoch timestamp in milliseconds. This is not Unix epoch (January 1, 1970) but a custom epoch chosen at system design time - typically a date close to the system’s launch. Twitter’s original epoch was 2010-11-04T01:42:54.657Z. Using a custom epoch instead of Unix epoch has a concrete benefit: it keeps the timestamp value smaller, which means the IDs themselves are smaller numbers and easier to handle in systems that have integer size limits. 2^41 = 2,199,023,255,552 milliseconds = 69.7 years from whatever epoch you choose.

The next 10 bits are the machine identifier, split in Twitter’s original design into a 5-bit datacenter ID and a 5-bit worker ID within that datacenter. This gives 32 datacenters x 32 nodes = 1,024 unique machine IDs globally. Some implementations use the full 10 bits as a flat machine ID (0 to 1023) assigned by the registry, which is simpler but loses the geographic hint. At Instagram’s scale they use 13 bits for shard ID (up to 8,192 shards) and keep the timestamp at 41 bits.

The final 12 bits are the sequence number, a per-millisecond counter that starts at 0 for each new millisecond and increments with every ID generated within that millisecond. 2^12 = 4,096, so each node can generate up to 4,096 unique IDs per millisecond (4.096 million per second). If the sequence overflows within a millisecond (i.e., you need a 4,097th ID in the same millisecond), the generator must wait until the clock advances to the next millisecond before resetting the sequence and continuing.

# Snowflake bit layout constants
EPOCH_MS        = 1704067200000   # 2024-01-01 00:00:00 UTC in milliseconds
TIMESTAMP_BITS  = 41
DATACENTER_BITS = 5
WORKER_BITS     = 5
SEQUENCE_BITS   = 12

DATACENTER_SHIFT = WORKER_BITS + SEQUENCE_BITS        # 17
TIMESTAMP_SHIFT  = DATACENTER_BITS + DATACENTER_SHIFT # 22

MAX_DATACENTER_ID = (1 << DATACENTER_BITS) - 1  # 31
MAX_WORKER_ID     = (1 << WORKER_BITS) - 1      # 31
MAX_SEQUENCE      = (1 << SEQUENCE_BITS) - 1    # 4095

def assemble_id(timestamp_ms: int, datacenter_id: int,
                worker_id: int, sequence: int) -> int:
    # All fields are validated before this call
    ts_offset = timestamp_ms - EPOCH_MS
    return (
        (ts_offset      << TIMESTAMP_SHIFT) |
        (datacenter_id  << DATACENTER_SHIFT) |
        (worker_id      << WORKER_BITS) |
        sequence
    )

def extract_fields(snowflake_id: int) -> dict:
    return {
        "timestamp_ms":   (snowflake_id >> TIMESTAMP_SHIFT) + EPOCH_MS,
        "datacenter_id":  (snowflake_id >> DATACENTER_SHIFT) & MAX_DATACENTER_ID,
        "worker_id":      (snowflake_id >> WORKER_BITS) & MAX_WORKER_ID,
        "sequence":       snowflake_id & MAX_SEQUENCE,
    }
Real World

Instagram’s Sharding & IDs post (2012) describes a Postgres-based variant where the 64-bit ID is generated inside a PL/pgSQL function per shard. Each shard has its own sequence, and the shard ID is encoded in the ID. This avoids a separate service entirely at the cost of tighter database coupling. Discord later moved to a pure in-process Snowflake library to eliminate the database dependency for ID generation.

Clock Drift Handling and Monotonicity Guarantee

Clock drift is the hardest failure mode in this system, and the one most engineers underestimate. The problem is simple: NTP can adjust your system clock backward. When a node’s clock jumps back 5 milliseconds to synchronize with the NTP server, the next call to currentTimeMillis() returns a value smaller than the previous call. If we naively use that timestamp, we will generate an ID with a smaller timestamp than the previous ID - violating the time-sortability guarantee. Worse, if another node generated an ID with the same timestamp in between, we might produce a duplicate.

Snowflake ID generation request data flow showing clock check, drift handling, sequence increment, and bit assembly

The monotonicity guarantee is enforced by tracking last_ts - the timestamp used for the most recently generated ID. Every call to generateId() begins by reading the current clock. If current_ts < last_ts, the clock went backward. The response depends on the magnitude:

For small backward adjustments (under ~5ms), the generator simply waits in a busy loop until current_ts >= last_ts. This is safe because NTP adjustments are typically small, and 5ms of busy waiting is invisible in most workloads.

For large backward adjustments (over a configurable threshold like 5ms), the generator raises an exception and refuses to generate IDs. This forces the caller to handle the failure explicitly. The reasoning is that a large backward clock jump suggests something seriously wrong - a VM migration, a suspended container that resumed, or a misconfigured NTP client. Silently waiting would block all ID generation for an unacceptable duration.

# Thread-safe Snowflake ID generator with clock drift protection
import threading
import time

class SnowflakeGenerator:
    MAX_BACKWARD_WAIT_MS = 5

    def __init__(self, datacenter_id: int, worker_id: int):
        if datacenter_id > MAX_DATACENTER_ID or datacenter_id < 0:
            raise ValueError(f"datacenter_id must be 0-{MAX_DATACENTER_ID}")
        if worker_id > MAX_WORKER_ID or worker_id < 0:
            raise ValueError(f"worker_id must be 0-{MAX_WORKER_ID}")
        self.datacenter_id = datacenter_id
        self.worker_id     = worker_id
        self.sequence      = 0
        self.last_ts       = -1
        self._lock         = threading.Lock()

    def _current_ms(self) -> int:
        return int(time.time() * 1000)

    def _wait_next_ms(self, last_ts: int) -> int:
        ts = self._current_ms()
        while ts <= last_ts:
            ts = self._current_ms()
        return ts

    def next_id(self) -> int:
        with self._lock:
            ts = self._current_ms()

            if ts < self.last_ts:
                diff = self.last_ts - ts
                if diff > self.MAX_BACKWARD_WAIT_MS:
                    raise RuntimeError(
                        f"Clock moved backward by {diff}ms. "
                        f"Refusing to generate ID. last_ts={self.last_ts} current={ts}"
                    )
                # Small drift: spin until clock catches up
                while ts < self.last_ts:
                    ts = self._current_ms()

            if ts == self.last_ts:
                self.sequence = (self.sequence + 1) & MAX_SEQUENCE
                if self.sequence == 0:
                    # Sequence exhausted this millisecond: wait for next ms
                    ts = self._wait_next_ms(self.last_ts)
            else:
                # New millisecond: reset sequence
                self.sequence = 0

            self.last_ts = ts
            return assemble_id(ts, self.datacenter_id, self.worker_id, self.sequence)
Watch Out

Running Snowflake generators inside containers that get live-migrated (e.g., Kubernetes pods on pre-emptible nodes) is dangerous. The container can be suspended mid-flight and resumed on a different physical host where the system clock is slightly behind. Even a 1ms backward jump causes a clock drift error. Always configure your clock drift threshold generously and add alerting when the drift path is hit - it means your infrastructure has a time synchronization problem.

Machine ID Assignment

The machine ID is what separates Snowflake from a simpler single-node timestamp counter. It is the bit that guarantees two nodes in different datacenters, processing requests at exactly the same millisecond with the same sequence counter value, will produce different IDs.

Think of machine IDs like parking spots in a multi-story garage. The garage (ID registry) has 1,024 numbered spots. When a node starts up, it drives in and takes any available spot, putting its name on the ticket. The node remembers its spot number for the duration of the session. When the node shuts down, it vacates the spot. If the node crashes without vacating, the spot stays marked “occupied” for a short timeout (30 seconds with an ephemeral etcd lease) and then becomes available again.

# Machine ID allocation via etcd ephemeral leases
import etcd3

class MachineIdAllocator:
    LEASE_TTL_SECONDS = 30
    MAX_RETRIES = 10

    def __init__(self, etcd_host: str, etcd_port: int, prefix: str = "/snowflake/workers/"):
        self.client  = etcd3.client(host=etcd_host, port=etcd_port)
        self.prefix  = prefix
        self.lease   = None
        self.worker_id = None

    def acquire(self, datacenter_id: int) -> int:
        """Atomically claim an available worker ID in [0..31] for this datacenter."""
        prefix = f"{self.prefix}dc{datacenter_id}/"
        for attempt in range(self.MAX_RETRIES):
            # Scan all taken slots
            taken = set()
            for _, meta in self.client.get_prefix(prefix):
                slot = int(meta.key.decode().replace(prefix, ""))
                taken.add(slot)

            for slot in range(MAX_WORKER_ID + 1):
                if slot not in taken:
                    self.lease = self.client.lease(self.LEASE_TTL_SECONDS)
                    key = f"{prefix}{slot}"
                    # Conditional put: only succeed if key does not exist
                    success, _ = self.client.transaction(
                        compare=[self.client.transactions.version(key) == 0],
                        success=[self.client.transactions.put(key, "1", lease=self.lease)],
                        failure=[]
                    )
                    if success:
                        self._start_keepalive()
                        self.worker_id = slot
                        return slot
        raise RuntimeError(f"No available worker ID in datacenter {datacenter_id} after {self.MAX_RETRIES} attempts")

    def _start_keepalive(self):
        import threading
        def keepalive():
            while self.lease:
                self.client.refresh_lease(self.lease)
                import time
                time.sleep(self.LEASE_TTL_SECONDS // 2)
        t = threading.Thread(target=keepalive, daemon=True)
        t.start()

    def release(self):
        if self.lease:
            self.client.revoke_lease(self.lease)
            self.lease = None

The keepalive thread refreshes the lease every 15 seconds. If the node crashes and the keepalive stops, the lease expires after 30 seconds and the worker slot becomes available. This means there is a 30-second window after a crash during which the dead node’s worker ID could theoretically be reused by a new node. Is that a collision risk? No - because the dead node’s last generated ID used the clock at the time of death, and the new node can only start generating IDs using the current clock (which is 30+ seconds later). The timestamp component guarantees they will never produce the same ID even with the same machine ID.

Key Insight

The clock and machine ID work together as a two-layer uniqueness guarantee: the clock ensures IDs from different milliseconds never collide, and the machine ID ensures IDs from the same millisecond on different nodes never collide. You only need the sequence counter to separate IDs from the same node in the same millisecond.

Sequence Overflow Handling

The 12-bit sequence counter gives 4,096 unique slots per millisecond per node. At typical service load this is never a concern - a heavily loaded microservice might generate a few hundred IDs per millisecond. But at extreme load (batch imports, event floods, or ID pre-generation) the sequence can exhaust.

When sequence reaches 4,095 and increments to 4,096, the bitwise AND with 0xFFF wraps it back to 0. At this point the generator has used all available sequence values for this millisecond. It must wait until the system clock advances to the next millisecond before resetting the sequence and continuing. The _wait_next_ms() function does a busy spin reading the clock repeatedly. This spin is expected to last less than 1ms and is therefore safe.

What should the system do if a service genuinely needs more than 4,096 IDs per millisecond? The correct answer is to run more generator nodes. A service that needs 10,000 IDs/ms should have three or four generator nodes running simultaneously, each with a different machine ID. The IDs from each node will be interleaved but each will be unique. This horizontal scaling pattern means you never need to increase the sequence bit width.

# Demonstrating sequence overflow and wait behavior
def _wait_next_ms(self, last_ts: int) -> int:
    # Busy spin - expected to exit within <1ms
    # Safe because sequence overflow is rare and short
    ts = self._current_ms()
    spins = 0
    while ts <= last_ts:
        ts = self._current_ms()
        spins += 1
        # Safety valve: if we spin for more than 10ms, something is wrong
        if spins > 10_000:
            raise RuntimeError(
                f"Waited >10ms for clock to advance past {last_ts}. "
                f"Current clock: {ts}. Possible clock freeze."
            )
    return ts
Real World

Discord uses Snowflake IDs for all messages, channels, servers, and users. Their epoch is Discord’s launch date (January 1, 2015). A Discord message ID of 175928847299117063 decodes to August 2016, shard 1, sequence 7. This lets Discord engineers instantly understand when any entity was created and which shard it lives on just by inspecting the ID - no database lookup required for basic forensics.

Data Model

The ID generator itself has minimal persistent state - almost everything is in-process. But the machine ID registry and the audit log have real schemas.

-- Machine ID registry table (used as fallback if etcd is unavailable)
-- Primary source of truth is etcd; this is a warm standby
CREATE TABLE machine_id_registry (
    id              SERIAL PRIMARY KEY,
    datacenter_id   SMALLINT NOT NULL CHECK (datacenter_id >= 0 AND datacenter_id <= 31),
    worker_id       SMALLINT NOT NULL CHECK (worker_id >= 0 AND worker_id <= 31),
    hostname        VARCHAR(255) NOT NULL,
    ip_address      INET NOT NULL,
    pid             INTEGER NOT NULL,
    allocated_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    last_heartbeat  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    released_at     TIMESTAMPTZ,
    status          VARCHAR(16) NOT NULL DEFAULT 'active'
                    CHECK (status IN ('active', 'released', 'expired')),
    UNIQUE (datacenter_id, worker_id, status)
    WHERE status = 'active'
);

CREATE INDEX idx_registry_dc_worker
    ON machine_id_registry (datacenter_id, worker_id)
    WHERE status = 'active';

CREATE INDEX idx_registry_heartbeat
    ON machine_id_registry (last_heartbeat)
    WHERE status = 'active';
-- ID generation audit log (optional, for debugging only - never on hot path)
CREATE TABLE id_gen_audit (
    id              BIGINT NOT NULL,
    datacenter_id   SMALLINT NOT NULL,
    worker_id       SMALLINT NOT NULL,
    sequence        SMALLINT NOT NULL,
    generated_at    TIMESTAMPTZ NOT NULL,
    caller_service  VARCHAR(64)
) PARTITION BY RANGE (generated_at);

-- Partition by day - only retain for 7 days
CREATE TABLE id_gen_audit_2026_06_04
    PARTITION OF id_gen_audit
    FOR VALUES FROM ('2026-06-04') TO ('2026-06-05');

The machine_id_registry uses a partial unique index - only one active row per (datacenter_id, worker_id) combination is allowed. When a node is released or expires, its status changes to released or expired, which removes it from the unique constraint scope. This allows the same slot to be re-allocated in the future without violating uniqueness.

Multi-datacenter scaling diagram showing 32 nodes per datacenter, etcd registry, and global capacity math

Key Algorithms and Protocols

The Bit Assembly Algorithm

The core algorithm is a set of bit shifts and OR operations. It looks simple, but the specific choice of bit widths encodes a set of capacity and ordering constraints that are fixed for the lifetime of the system. Changing the bit widths after launch requires migrating all existing IDs.

# Full production-grade Snowflake generator with batch support
import threading
import time
from dataclasses import dataclass

EPOCH_MS        = 1704067200000  # 2024-01-01 00:00:00 UTC
TIMESTAMP_BITS  = 41
DATACENTER_BITS = 5
WORKER_BITS     = 5
SEQUENCE_BITS   = 12
TIMESTAMP_SHIFT = DATACENTER_BITS + WORKER_BITS + SEQUENCE_BITS  # 22
DATACENTER_SHIFT = WORKER_BITS + SEQUENCE_BITS                   # 17
WORKER_SHIFT    = SEQUENCE_BITS                                   # 12
MAX_SEQUENCE    = (1 << SEQUENCE_BITS) - 1                       # 4095
MAX_WORKER_ID   = (1 << WORKER_BITS) - 1                        # 31
MAX_DC_ID       = (1 << DATACENTER_BITS) - 1                     # 31

@dataclass
class SnowflakeId:
    raw: int

    @property
    def timestamp_ms(self) -> int:
        return (self.raw >> TIMESTAMP_SHIFT) + EPOCH_MS

    @property
    def datacenter_id(self) -> int:
        return (self.raw >> DATACENTER_SHIFT) & MAX_DC_ID

    @property
    def worker_id(self) -> int:
        return (self.raw >> WORKER_SHIFT) & MAX_WORKER_ID

    @property
    def sequence(self) -> int:
        return self.raw & MAX_SEQUENCE

class SnowflakeGenerator:
    MAX_CLOCK_BACKWARD_MS = 5

    def __init__(self, datacenter_id: int, worker_id: int):
        assert 0 <= datacenter_id <= MAX_DC_ID
        assert 0 <= worker_id <= MAX_WORKER_ID
        self.datacenter_id = datacenter_id
        self.worker_id     = worker_id
        self.sequence      = 0
        self.last_ts       = -1
        self._lock         = threading.Lock()

    def _ms(self) -> int:
        return int(time.monotonic_ns() // 1_000_000)

    def next_id(self) -> int:
        return self.next_batch(1)[0]

    def next_batch(self, count: int) -> list[int]:
        if count <= 0 or count > MAX_SEQUENCE + 1:
            raise ValueError(f"count must be 1..{MAX_SEQUENCE + 1}")
        results = []
        with self._lock:
            for _ in range(count):
                ts = self._ms()
                if ts < self.last_ts:
                    diff = self.last_ts - ts
                    if diff > self.MAX_CLOCK_BACKWARD_MS:
                        raise RuntimeError(
                            f"Clock backward by {diff}ms (last_ts={self.last_ts}, now={ts})"
                        )
                    while ts < self.last_ts:
                        ts = self._ms()

                if ts == self.last_ts:
                    self.sequence = (self.sequence + 1) & MAX_SEQUENCE
                    if self.sequence == 0:
                        while ts <= self.last_ts:
                            ts = self._ms()
                else:
                    self.sequence = 0

                self.last_ts = ts
                raw_id = (
                    ((ts - EPOCH_MS) << TIMESTAMP_SHIFT) |
                    (self.datacenter_id << DATACENTER_SHIFT) |
                    (self.worker_id << WORKER_SHIFT) |
                    self.sequence
                )
                results.append(raw_id)
        return results

Note the use of time.monotonic_ns() instead of time.time(). Monotonic clock is not subject to NTP backward adjustments - it only moves forward. The tradeoff is that monotonic clock has no absolute meaning across processes or reboots, but since we compute ts - EPOCH_MS and the epoch is a wall-clock time, we need the absolute value. The real production approach is to use a combination: use the wall clock for the epoch offset, but add extra protection by caching last_ts and refusing to go backward.

Key Insight

Using time.monotonic_ns() eliminates NTP backward adjustment failures entirely, but introduces a subtle problem: after a process restart, the monotonic clock resets to zero, so the new process could momentarily produce timestamps earlier than the previous process if restarted within the same millisecond. The correct fix is to initialize last_ts from the current wall clock at startup, not from zero.

The Sequence Overflow Protocol

When sequence overflow is detected (sequence wraps to 0), the generator must not generate an ID with sequence 0 at the same timestamp as the previous ID with sequence 4095. The invariant is: (ts, machine_id, sequence) is a unique triple. Since we cannot reuse (last_ts, machine_id, 0) if (last_ts, machine_id, 0..4095) have already been generated, we wait for ts > last_ts before resetting.

# Sequence overflow: correct implementation
# WRONG approach - generates duplicate triple
if self.sequence == 0:
    self.last_ts = ts  # BUG: allows reusing same (ts, machine_id, 0)

# CORRECT approach
if self.sequence == 0:
    # Block until next millisecond. The new millisecond resets sequence safely.
    ts = self._wait_next_ms(self.last_ts)
    # Now ts > last_ts, and sequence=0 is safe to use again
    self.last_ts = ts

Scaling and Performance

Key Insight

Snowflake generators scale by adding nodes, not by sharding state. Since each node is fully independent after startup, you can double your ID generation throughput by deploying twice as many generator processes - no coordination, no rebalancing, no migrations.

Capacity Estimation

Given:
  - 1,024 total nodes (32 DCs x 32 workers per DC)
  - 4,096 IDs per millisecond per node
  - 41-bit timestamp in milliseconds
  - 69.7 year range from 2024-01-01 epoch

Per-node throughput:
  4,096 IDs/ms = 4,096,000 IDs/sec per node

Global throughput:
  4,096,000 * 1,024 nodes = 4,194,304,000 IDs/sec (~4.2 billion/sec)

ID size:
  64 bits = 8 bytes per ID

Storage at 100M IDs/day across all nodes:
  100,000,000 * 8 bytes = 800 MB/day (IDs only, no associated data)

Typical real-world usage:
  - Twitter in 2012: ~3,000 IDs/sec (trivial)
  - Discord: ~10,000 IDs/sec across all entity types
  - Maximum theoretical bottleneck: sequence overflow at 4,096 IDs/ms
    - Requires single node to sustain >4 million inserts/sec
    - Realistic fix: add more generator nodes, never hit in practice

Deployment Modes

There are two primary deployment strategies with different latency profiles.

Embedded library mode: The generator is a library included in each service. Each service instance gets its own machine ID at startup via the registry. ID generation is a pure in-process function call - nanosecond latency, no network. This is the preferred mode for high-throughput services. The downside is that each service instance consumes a machine ID slot, so with 100 services x 10 instances each = 1,000 machine ID slots consumed, approaching the 1,024 limit.

Centralized ID service mode: A dedicated pool of generator nodes (5-10 instances) serve all ID generation requests via gRPC. Services make a network call to get IDs. Latency is 1-3ms (LAN). This conserves machine ID slots (only 10 slots used regardless of service count) but introduces a network hop and a dependency. For most workloads under 100,000 IDs/sec, a 5-node ID service handles the load comfortably.

# Kubernetes deployment for centralized ID service
apiVersion: apps/v1
kind: Deployment
metadata:
  name: id-generator-service
spec:
  replicas: 5
  selector:
    matchLabels:
      app: id-generator
  template:
    spec:
      containers:
      - name: id-generator
        image: myorg/snowflake-server:1.4.2
        resources:
          requests:
            cpu: "100m"
            memory: "64Mi"
          limits:
            cpu: "500m"
            memory: "128Mi"
        env:
        - name: ETCD_ENDPOINTS
          value: "etcd-0.etcd:2379,etcd-1.etcd:2379,etcd-2.etcd:2379"
        - name: DATACENTER_ID
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['datacenter-id']
        - name: LEASE_TTL_SECONDS
          value: "30"
        - name: MAX_CLOCK_BACKWARD_MS
          value: "5"
        readinessProbe:
          grpc:
            port: 9090
          initialDelaySeconds: 5
          periodSeconds: 10
Real World

Uber’s UUID (later renamed UberID) system, described in their 2016 engineering blog, uses a variant of Snowflake where each data region gets a fixed region bit prefix. Their lessons learned include: never use UUID v4 for database primary keys (index fragmentation), always verify clock sync health before generating production IDs, and machine ID lease TTL should be long enough that brief network partitions to etcd do not cause ID generation to fail.

Failure Modes and Recovery

FailureDetectionImpactRecovery
Generator node crashetcd lease expiry (30s TTL)ID generation stops on that node; callers get errors or fall back to another nodeRestart node; it re-acquires machine ID from registry at boot
Clock drift backward (small, < 5ms)ts < last_ts check in next_id()Brief spin wait of up to 5ms; callers see slightly elevated latencyAutomatic - generator spins until clock catches up
Clock drift backward (large, > 5ms)ts < last_ts check with diff thresholdnext_id() throws RuntimeError; all ID generation on this node haltsAlert and investigate: NTP misconfiguration, VM migration, clock source change
etcd cluster unavailable at startupgRPC timeout on acquire() callNode cannot acquire machine ID; startup failsNode retries with exponential backoff; alert if etcd down for > 2 minutes
Machine ID slot exhaustion (all 1,024 slots taken)acquire() returns error after MAX_RETRIESNew nodes cannot start; existing nodes unaffectedInvestigate zombie leases in etcd; force-release stale slots via admin CLI
Sequence overflow (transient)sequence == 0 after incrementBrief sub-millisecond wait for clock advance; throughput capped at 4096 IDs/msAutomatic - generator blocks until next ms; if persistent, add more nodes
Datacenter network partitionetcd lease refresh failsNode continues generating IDs (machine ID remains in memory) but lease expires after TTLDuring partition: safe if no new node claims the same machine ID; after partition heals, stale lease is cleared
Watch Out

The most common operational mistake is deploying generators inside containers that can be live-migrated or suspended and resumed (common with Kubernetes on spot instances). A container suspended for even 100ms and resumed can see its in-memory last_ts value older than the current clock - but if the same machine ID was reassigned to another node during the suspension window, you now have two nodes with the same machine ID generating IDs with overlapping timestamps. Always set pod disruption budgets and use PodAntiAffinity to prevent this scenario, and configure etcd lease TTL to be longer than your worst-case container suspension time.

Comparison of Approaches

ApproachLatencyCoordinationFailure ModeBest Fit
DB auto-increment1-10ms (network)Strong, per IDSingle point of failure; bottleneck at >10K/secSmall monolith, low write volume
UUID v4 (random)Nanoseconds (in-process)NoneIndex fragmentation at high write rates; no time-sortAny scale where ordering/size do not matter
UUID v7 (time-ordered)Nanoseconds (in-process)None128-bit size; per-ms resolution only; no machine identityModern systems that need ordering but not compactness
Snowflake (this design)Nanoseconds (in-process)Once at startup (machine ID)Clock drift; machine ID exhaustion at >1,024 nodesHigh-throughput services needing 64-bit time-sortable IDs
ULIDNanoseconds (in-process)NoneLarger than Snowflake (128-bit); lexicographic not numericSystems using string primary keys; no integer constraint
Mongo ObjectIDNanoseconds (in-process)None96-bit; second precision (not ms); process-ID basedMongoDB-native workloads only

The choice between embedded Snowflake and UUID v7 comes down to one question: do you need 64-bit integers or are 128-bit identifiers acceptable? UUID v7 was standardized in RFC 9562 (2024) and offers time-ordered UUIDs with millisecond precision and 74 bits of random data - excellent collision resistance with no coordination. If you are on a new system and your database supports UUID natively (PostgreSQL does, with good B-tree performance), UUID v7 is the simpler choice. If you need a 64-bit integer for storage efficiency, foreign key compatibility with legacy schemas, or to encode metadata (datacenter, node) in the ID itself, Snowflake is the right pick.

Key Takeaways

  • Snowflake bit layout encodes three concerns in 64 bits: when the ID was created (timestamp), where (machine ID), and its position in a burst (sequence) - each field independently decodable.
  • Coordination-free generation is achieved by front-loading coordination to node startup: the one-time machine ID assignment via etcd means every ID generation is a pure arithmetic operation with zero network calls.
  • Monotonicity guarantee comes from tracking last_ts and refusing to generate IDs with a smaller timestamp - the sequence counter handles same-millisecond ordering within a node.
  • Clock drift handling splits into two cases: small backward adjustments (spin-wait) and large adjustments (fail-fast), because silently accepting large clock jumps would violate uniqueness guarantees.
  • Sequence overflow is handled by blocking until the next millisecond, not by increasing sequence bits - the correct scaling response is to add more nodes, each with a distinct machine ID.
  • Machine ID exhaustion is the real capacity limit, not throughput - 1,024 machine IDs is sufficient for most systems, but sidecar-based or per-replica embedded generators can exhaust slots in large Kubernetes clusters.
  • UUID v7 is the modern alternative for new systems that accept 128-bit IDs - it provides time-ordering with no coordination and no machine ID management overhead.
  • etcd TTL tuning is critical in containerized deployments - TTL must be longer than your worst-case pod suspension window to prevent machine ID reuse while a paused pod is still technically alive.

The counter-intuitive lesson from this design is that guaranteeing uniqueness across thousands of nodes does not require those nodes to talk to each other. The insight is that the time dimension and the machine identity dimension are orthogonal uniqueness axes: if two IDs have different timestamps they are unique regardless of machine ID, and if two IDs have different machine IDs they are unique regardless of timestamp. The sequence counter is the narrow-but-adequate solution for the only remaining collision space: same machine, same millisecond. This decomposition of the uniqueness problem into independent dimensions is what makes the algorithm elegant and its performance guarantees tight.

Frequently Asked Questions

Q: Why 41 bits for timestamp instead of 48 bits to get more years?

A: The field widths are a zero-sum game - every extra bit for timestamp means one fewer bit elsewhere. With 41 bits you get 69 years from your epoch. With 48 bits you would get over 8,000 years, which is unnecessary, and you would have to shrink the machine ID or sequence field. Reducing sequence from 12 to 5 bits would cut per-node throughput from 4,096 IDs/ms to 32 IDs/ms - a 128x reduction. Reducing machine ID from 10 to 3 bits would limit you to 8 total nodes globally. The 41/10/12 split is a carefully tuned balance for Twitter-scale systems circa 2010, and most organizations running for under 70 years have no reason to change it.

Q: Why not use Redis INCR as a distributed counter instead of this complexity?

A: Redis INCR is a coordination approach - every ID requires a network round-trip to a Redis primary. Under load this becomes a bottleneck, and during a Redis failover (typically 10-30 seconds for Sentinel-based failover) ID generation stops entirely. The Snowflake approach’s machine ID + timestamp means the generator continues working even if the registry is completely unreachable, as long as the node already has its machine ID. The only Redis-like approach that avoids the bottleneck is pre-allocating ID ranges (take a block of 10,000 IDs from Redis, serve them locally), which is complex and still has a single-point-of-failure problem for the Redis cluster itself.

Q: What happens if two nodes accidentally get the same machine ID?

A: This is the most serious failure mode and it produces silent ID collisions. The etcd transaction mechanism (conditional put) prevents two nodes from claiming the same slot simultaneously. However, if etcd is split-brained during a network partition, two nodes in different partitions could both believe they own the same slot. Mitigation: use a strict etcd quorum configuration (initial-cluster-state: existing, Raft quorum enforcement), and add monotonic clock drift alerts to detect when a node is generating IDs that overlap with known-live nodes.

Q: Can we use Snowflake IDs as cursor-based pagination tokens?

A: Yes, and this is one of the most useful properties of Snowflake IDs. Because they are time-sortable, you can implement WHERE id > :last_seen_id ORDER BY id ASC LIMIT 100 and get the correct next page without storing a separate created_at index. This is how Twitter’s timeline API, Discord’s message history API, and most large-scale feeds work. The cursor is just the last ID seen, decoded as a 64-bit integer.

Q: What does UUID v7 offer that Snowflake does not?

A: UUID v7 requires no coordination whatsoever (no machine ID registry), is standardized in an RFC so every language has a library, is 128-bit with 74 bits of randomness (essentially zero collision probability without coordination), and is natively supported as a data type in PostgreSQL 17. The cost is 128-bit vs 64-bit size, which means double the storage for primary keys and foreign keys. For new greenfield systems on modern databases, UUID v7 is often the better choice. Snowflake wins when you need 64-bit integers, are on a legacy schema that expects BIGINT PKs, or want the machine ID embedded in the ID for operational debuggability.

Q: How does Instagram’s approach differ from Twitter’s Snowflake?

A: Instagram (2012) generates IDs inside Postgres PL/pgSQL functions on each database shard, using the shard’s own sequence counter and the shard ID embedded in the ID. The epoch timestamp is at millisecond precision (41 bits), the shard ID uses 13 bits (up to 8,192 shards), and the sequence uses 10 bits (1,024 IDs/ms per shard). This eliminates the need for a separate ID service entirely - the database that stores the data also generates its primary keys. The downside is coupling between schema and ID generation, and it does not work across multiple database vendors.

Interview Questions

Q: Design a distributed ID generator that produces 64-bit time-sortable integers across 1,000 nodes with zero coordination overhead per ID.

Expected depth: Describe the Snowflake bit layout (41 timestamp + 10 machine ID + 12 sequence). Explain why you need a fixed custom epoch rather than Unix epoch. Discuss the machine ID assignment problem - etcd ephemeral leases vs. hash of hostname vs. static config. Walk through the clock drift failure mode and explain both the spin-wait and fail-fast handling. Mention sequence overflow and how to scale throughput by adding nodes.

Q: Your ID generator starts returning errors in production. You see logs saying “clock moved backward by 200ms”. What happened and what do you do?

Expected depth: Diagnose root causes: NTP resync after prolonged drift, VM live migration to a host with a different clock, container resume after suspension, or a rogue date -s command. Immediate action: check NTP sync status (chronyc tracking), check for recent pod migrations in Kubernetes events, check if multiple generator instances share a machine ID. Long-term fix: increase MAX_CLOCK_BACKWARD_MS threshold, add alerting on drift events, use Chrony with makestep disabled for large adjustments, and consider switching from wall-clock to a monotonic clock offset.

Q: How would you modify Snowflake to support 100,000 generator nodes instead of 1,024?

Expected depth: The 10-bit machine ID field limits you to 1,024 nodes. To support 100,000 nodes you need at minimum 17 bits (2^17 = 131,072). You have two options: expand the machine ID field by shrinking timestamp (reduces lifespan) or sequence (reduces throughput per node). A 17-bit machine ID with 36-bit timestamp (2 years) and 10-bit sequence (1,024 IDs/ms) would work for a system with a short lifespan. Alternatively, use 128-bit IDs (UUID v7 territory) and get both time-sorting and unlimited nodes. The right answer is to challenge the premise: 100,000 generator nodes means 100,000 services each embedding a library generator - at that point you should be using UUID v7 instead of fighting Snowflake’s 10-bit limit.

Q: How do you handle the case where the etcd cluster is unavailable at startup for your ID generator?

Expected depth: On startup, the generator needs its machine ID before it can generate any IDs. If etcd is down, you have several options: (1) retry with exponential backoff and fail the service start if etcd is unavailable for more than N seconds - safest, prevents zombie generators; (2) use a pre-configured static machine ID from environment variables or a config file as fallback - works for static deployments, dangerous in dynamic environments where the same static ID might be used by multiple instances; (3) derive a machine ID from a hash of the hostname - deterministic but could collide. The production answer for most teams is option 1 with a 60-second retry window and a clear alert.

Q: A service uses Snowflake IDs as Kafka message keys for partitioning. Engineers notice that all messages from the same second go to the same partition, creating hot spots. How do you fix this?

Expected depth: Snowflake IDs are time-sortable, which means IDs generated in the same second will be numerically close together. If Kafka partitions by key % num_partitions, and all keys are in a narrow numeric range (same second’s IDs), they will hash to a small set of partitions rather than distributing evenly. The fix is to not use the raw Snowflake ID as the Kafka partition key. Instead, use a hash of the entity identifier (user ID, order ID) as the partition key, which ensures related messages go to the same partition regardless of creation time. The Snowflake ID remains the message key for ordering within a partition, but the actual partition assignment uses a different attribute.

Continue Learning

Want to see how these patterns hold up when traffic spikes 50x at 3 AM? That's exactly what this Premium deep-dive covers.