Build Google Docs Real-Time Collaborative Editing

distributed-systems reliability scalability

System Design Deep Dive

Real-Time Collaborative Editing

The fundamental tension: every user must see their own keystrokes applied instantly, yet the document must converge to the same state across millions of concurrent editors.

⏱ 14 min read📐 Advanced🏗️ Architecture

Picture two colleagues writing on the same physical whiteboard from opposite ends of the room. They can both write simultaneously without bumping into each other, as long as they stay in separate areas. But the moment they both try to edit the same sentence - one crossing out a word while the other rewrites it - you need a referee, a way to reconcile two independent histories into one coherent result. Google Docs does this for millions of simultaneous writers across the internet, with a 90ms round-trip constraint and no referee standing by to adjudicate disputes.

The scale makes the challenge concrete: at peak, Google Docs serves tens of millions of concurrent document sessions. A popular shared document might have dozens of people editing simultaneously, each on different network connections with different latencies. Every keystroke a user types must feel instant on their own screen - that means applying it locally without waiting for server acknowledgment. Yet that same keystroke must somehow be reconciled with edits from every other user and result in a document that looks identical on all screens within a few hundred milliseconds.

Naive approaches collapse immediately under this constraint. Last-write-wins destroys work: if Alice types “Hello” and Bob simultaneously types “World” in the same position, one of them loses their edit entirely. Pessimistic locking prevents the problem but also prevents concurrency - only one user can type at a time, which is useless for a real-time collaboration tool. Simple merging strategies like operational queuing break when network partitions occur or when a user goes offline and makes edits that need to be reconciled later.

The forces in tension are: latency vs consistency (applying locally is fast but potentially wrong), offline support vs convergence (users need to edit without connectivity, but their changes must merge correctly later), and simplicity vs correctness (simpler algorithms break on edge cases that real users hit constantly). We need to solve three core problems: conflict-free concurrent editing (two edits to the same location must both survive and be applied in a way that preserves both users’ intent), cursor position stability (when Alice inserts text before Bob’s cursor, Bob’s cursor must jump forward automatically), and offline convergence (edits made without network access must reconcile correctly when connectivity returns).

Requirements and Constraints

Functional Requirements

Multiple users can edit the same document simultaneously with their changes visible to all others in real time
Each user’s cursor position and selection range is broadcast to all other active collaborators
Users can edit offline and have their changes merged automatically when they reconnect
Full document version history with per-operation attribution and the ability to restore any prior state
Presence indicators showing which users are currently active in the document and their cursor positions
Document is always in a readable, consistent state - no partial operations or half-applied edits visible to any user

Non-Functional Requirements

Operation propagation latency under 100ms for 99th percentile from client submission to all connected peers seeing it
99.99% availability - editing must degrade gracefully, never block on a single server failure
Support for millions of concurrently active documents, each potentially with dozens of simultaneous editors
Strong eventual consistency - all clients that receive the same set of operations must converge to identical document state, regardless of operation arrival order
Cursor updates propagated within 50ms to minimize cursor teleportation on peers’ screens

Constraints

No pessimistic locking - users must never be blocked from typing by another user’s activity
Operations must commute - applying operation A then B must yield the same result as B then A after transformation
The convergence guarantee is the hard invariant - the system may sacrifice ordering guarantees but never convergence

High-Level Architecture

The system is composed of seven major components. The Client SDK runs in the browser, applying operations optimistically to the local document state and queuing them for transmission. The WebSocket Gateway manages persistent connections and routes messages to the correct document session. The Operation Service sequences incoming operations and applies the OT or CRDT transformation logic. The OT/CRDT Engine is the mathematical core that transforms concurrent operations so they can be safely applied. The Document Store persists the canonical document state, operation log, and version snapshots. The Presence Service tracks which users are active and their cursor positions. The Cursor Broadcast Service fans out cursor updates to all peers with minimal latency.

Google Docs real-time collaboration architecture

When a user types a character, the Client SDK immediately applies it to the local document (optimistic application) and emits an operation to the WebSocket Gateway over the persistent connection. The gateway routes the operation to the Operation Service responsible for that document’s session. The Operation Service assigns a sequence number, transforms the operation against any concurrent operations that arrived first, and persists it to the Document Store. The transformed operation is then broadcast to all other connected clients via the WebSocket Gateway. Each receiving client applies the operation to their own document state, using the same transformation logic to ensure convergence.

Key Architectural Decision

The operation service is stateful per document - a single service instance owns a document session for the duration it is active. This eliminates distributed coordination overhead on the critical path. Operations for the same document are processed sequentially by one instance, making transformation straightforward. The statefulness is tolerated because document sessions are ephemeral and can be rebuilt from the operation log on failover.

Component Deep Dives

The Operation Engine

The Operation Engine is the heart of the system - the component that makes concurrent edits safe. Think of it like a court reporter at a deposition where multiple witnesses are speaking at once: the reporter must transcribe every word accurately, even when two people speak simultaneously, and produce a single coherent transcript that faithfully represents what everyone said.

There are two dominant approaches to conflict-free collaborative editing: Operational Transformation (OT) and Conflict-free Replicated Data Types (CRDTs). Google Docs originally used OT, while Figma and Notion have moved toward CRDT-based approaches.

Operational Transformation works by transforming an operation’s parameters against other concurrent operations before applying it. If Alice inserts “X” at position 5 and Bob simultaneously inserts “Y” at position 5, the server transforms Bob’s operation to position 6 (after Alice’s insertion) before broadcasting it. The transform function is the critical primitive:

# Operational Transformation - core transform function
# Operations are: Insert(pos, char) or Delete(pos)

from dataclasses import dataclass
from typing import Union

@dataclass
class Insert:
    pos: int
    char: str
    client_id: str
    seq: int  # client-local sequence number

@dataclass
class Delete:
    pos: int
    client_id: str
    seq: int

Op = Union[Insert, Delete]

def transform(op: Op, against: Op) -> Op:
    """
    Transform `op` as if `against` has already been applied.
    Returns a new Op with adjusted position.
    OT property: apply(apply(doc, against), transform(op, against))
                 == apply(apply(doc, op), transform(against, op))
    """
    if isinstance(op, Insert) and isinstance(against, Insert):
        if against.pos < op.pos:
            # against inserts before op's position - shift right
            return Insert(op.pos + 1, op.char, op.client_id, op.seq)
        elif against.pos == op.pos:
            # tie-break by client_id to ensure deterministic ordering
            if against.client_id < op.client_id:
                return Insert(op.pos + 1, op.char, op.client_id, op.seq)
        return op

    if isinstance(op, Insert) and isinstance(against, Delete):
        if against.pos < op.pos:
            # a deletion before us shifts our position left
            return Insert(op.pos - 1, op.char, op.client_id, op.seq)
        return op

    if isinstance(op, Delete) and isinstance(against, Insert):
        if against.pos <= op.pos:
            # an insertion at or before us shifts our position right
            return Delete(op.pos + 1, op.client_id, op.seq)
        return op

    if isinstance(op, Delete) and isinstance(against, Delete):
        if against.pos < op.pos:
            return Delete(op.pos - 1, op.client_id, op.seq)
        elif against.pos == op.pos:
            # both deleted the same character - op becomes a no-op
            return None  # idempotent delete
        return op

    return op


def apply(doc: list, op: Op) -> list:
    """Apply an operation to a document represented as a character list."""
    if op is None:
        return doc
    doc = doc[:]
    if isinstance(op, Insert):
        doc.insert(op.pos, op.char)
    elif isinstance(op, Delete):
        if op.pos < len(doc):
            doc.pop(op.pos)
    return doc

CRDT-based approaches take a different path: instead of transforming operations, they design the data structure so that concurrent operations are always commutative and idempotent by construction. A common approach is a Sequence CRDT where every character has a globally unique ID and tombstones are used for deletions. No coordination is required because the merge is defined mathematically:

// CRDT Character - each char has a unique globally-ordered ID
type CharID struct {
    ClientID  string
    Timestamp int64
    Seq       int
}

type Char struct {
    ID       CharID
    Value    rune
    Deleted  bool
    After    CharID  // ID of the character this one follows
}

type CRDTDoc struct {
    Chars map[CharID]*Char
    Root  CharID  // sentinel head
}

// Merge two CRDT states - commutative and idempotent
func (d *CRDTDoc) Merge(remote *CRDTDoc) {
    for id, remoteChar := range remote.Chars {
        if local, exists := d.Chars[id]; exists {
            // Tombstone wins - deletion is permanent
            if remoteChar.Deleted {
                local.Deleted = true
            }
        } else {
            // New character from remote - add it
            c := *remoteChar
            d.Chars[id] = &c
        }
    }
}

// Linearize the CRDT into a readable string
func (d *CRDTDoc) ToText() string {
    // Build adjacency: after -> []chars sorted by ID
    children := map[CharID][]*Char{}
    for _, c := range d.Chars {
        children[c.After] = append(children[c.After], c)
    }
    // Sort children by (Timestamp, ClientID) for determinism
    var result []rune
    var walk func(id CharID)
    walk = func(id CharID) {
        cs := children[id]
        sortChars(cs)  // sort by CharID for deterministic order
        for _, c := range cs {
            if !c.Deleted {
                result = append(result, c.Value)
            }
            walk(c.ID)
        }
    }
    walk(d.Root)
    return string(result)
}

The WebSocket Gateway

The WebSocket Gateway is the nervous system of the collaboration architecture. It manages the persistent long-lived connections between clients and the server - think of it as the switchboard in a large telephone exchange, where thousands of callers need to be connected to the right conference room instantly.

Each WebSocket Gateway instance manages thousands of concurrent connections. The gateway uses document_id-based routing - all connections for the same document are affinity-routed to the same Operation Service instance. This is achieved with consistent hashing on document_id at the load balancer level. When a client sends an operation, the gateway parses the document_id from the message header and routes it to the correct downstream instance with minimal overhead.

Connection management handles the messy reality of mobile and unreliable networks. Clients implement exponential backoff reconnection with a maximum of 30 seconds between attempts. On reconnect, the client sends its last acknowledged operation sequence number, and the server replays any missed operations as a compressed delta. The gateway maintains a short in-memory buffer (30 seconds of operations) per document to serve fast replays without hitting the Document Store for minor network blips.

Google Docs operation data flow through OT engine

The Document Store

The Document Store has a dual responsibility: it must serve fast reads for clients loading a document, and it must provide an immutable, replayable operation log for version history and conflict resolution. Think of it like a bank ledger - the current balance is computed from the transaction history, but you also keep snapshots (monthly statements) to avoid replaying ten years of transactions every time someone checks their balance.

We use a snapshot-plus-delta storage model. The canonical document state is snapshotted every 500 operations or every 10 minutes, whichever comes first. Between snapshots, individual operations are stored as deltas in an append-only operations table. Loading a document means fetching the latest snapshot and replaying the operations that followed it - a bounded, fast operation.

-- Document storage schema
CREATE TABLE documents (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    title       TEXT NOT NULL DEFAULT 'Untitled Document',
    owner_id    UUID NOT NULL REFERENCES users(id),
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    is_deleted  BOOLEAN NOT NULL DEFAULT FALSE
);

-- Immutable operation log - append-only, never updated
CREATE TABLE operations (
    id              BIGSERIAL,
    document_id     UUID NOT NULL REFERENCES documents(id),
    seq             BIGINT NOT NULL,       -- global sequence within document
    client_seq      INT NOT NULL,          -- client-local sequence (for ACK)
    client_id       UUID NOT NULL,         -- which client submitted this
    op_type         TEXT NOT NULL,         -- 'insert' | 'delete' | 'retain'
    op_data         JSONB NOT NULL,        -- {pos, char} or {pos, len}
    applied_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (document_id, seq)
) PARTITION BY HASH (document_id);

-- Periodic snapshots to bound replay cost
CREATE TABLE document_versions (
    id              BIGSERIAL PRIMARY KEY,
    document_id     UUID NOT NULL REFERENCES documents(id),
    seq_at          BIGINT NOT NULL,       -- snapshot taken after this seq
    content         TEXT NOT NULL,         -- full document text at this seq
    char_count      INT NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE (document_id, seq_at)
);

CREATE INDEX idx_ops_doc_seq ON operations (document_id, seq);
CREATE INDEX idx_versions_doc ON document_versions (document_id, seq_at DESC);

The operations table is hash-partitioned by document_id. This distributes write load across shards while keeping all operations for a single document on the same partition - critical for efficient replay. We use a composite primary key of (document_id, seq) so that sequence-ordered scans are physically local.

The Presence and Cursor Service

The Presence and Cursor Service solves a different problem than the Operation Engine: it must broadcast ephemeral, high-frequency positional data to all collaborators with minimal latency. Unlike document operations - which must be durable and ordered - cursor positions are stateless and can be dropped. If a cursor update is lost, the next update (arriving 50ms later) will supersede it.

Cursor positions are maintained in Redis using per-document hash maps keyed by client_id. Each client sends a cursor update on every selection change, throttled to 30 updates per second. The Cursor Broadcast Service subscribes to Redis Pub/Sub channels (one per document) and fans out updates to all connected WebSocket clients. Because cursor data is ephemeral, it never touches the Document Store.

OT internals and presence service component diagram

When a document operation is applied, the Presence Service also transforms all active cursor positions through the same OT transform logic. If Alice inserts a character at position 10, every cursor at position 10 or higher must be shifted right. This transformation happens synchronously with operation application so that broadcasted cursor positions are always valid relative to the current document state.

Data Model

The data model maps directly to the three core concerns: documents (the thing being edited), operations (the immutable record of every edit), and versions (the periodic snapshots for fast loading).

-- Partitioned operations table for scale
-- Each shard holds ~10-20% of all documents
CREATE TABLE operations_p0 PARTITION OF operations
    FOR VALUES WITH (modulus 8, remainder 0);
CREATE TABLE operations_p1 PARTITION OF operations
    FOR VALUES WITH (modulus 8, remainder 1);
-- ... p2 through p7

-- Presence tracking (Redis-backed, but mirrored here for audit)
CREATE TABLE presence (
    document_id     UUID NOT NULL REFERENCES documents(id),
    client_id       UUID NOT NULL,
    user_id         UUID REFERENCES users(id),
    cursor_pos      INT,
    selection_start INT,
    selection_end   INT,
    color           TEXT,           -- assigned color for cursor display
    last_seen       TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (document_id, client_id)
);

-- Document sharing and permissions
CREATE TABLE document_access (
    document_id     UUID NOT NULL REFERENCES documents(id),
    user_id         UUID NOT NULL REFERENCES users(id),
    role            TEXT NOT NULL CHECK (role IN ('owner', 'editor', 'viewer')),
    granted_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (document_id, user_id)
);

The operation lifecycle flows from client submission to global sequence assignment. The Operation Service receives an operation with a client_id and client_seq. It acquires the document’s current sequence counter (stored in Redis for speed), transforms the operation if needed, increments the counter, and writes the operation to Postgres with the new global seq. The client receives an ACK with the global seq, allowing it to confirm the operation was committed and update its local state accordingly.

Key Algorithms and Protocols

OT Transform Function

The transform function is the mathematical guarantee that makes OT work. Every OT implementation must satisfy the TP1 property: given operations O1 and O2 on the same document state, transform(O1, O2) applied after O2 must yield the same result as transform(O2, O1) applied after O1. Violating this property causes divergence - clients see different documents.

# Server-side OT sequencer
# Transforms incoming ops against all concurrent ops with higher seq numbers

from collections import defaultdict

class OperationSequencer:
    def __init__(self, doc_store, redis_client):
        self.store = doc_store
        self.redis = redis_client

    def submit(self, document_id: str, op: Op, client_seq: int, client_id: str) -> int:
        """
        Transform op against all server ops the client hasn't seen yet,
        assign it a global sequence number, persist it, and return the seq.
        """
        lock_key = f"doc_lock:{document_id}"
        with self.redis.lock(lock_key, timeout=2):
            # Get current server seq for this document
            server_seq = int(self.redis.get(f"doc_seq:{document_id}") or 0)

            # Find all ops this client hasn't seen (submitted after client_seq)
            pending_ops = self.store.get_ops_after(document_id, client_seq)

            # Transform incoming op against each pending server op
            transformed = op
            for server_op in pending_ops:
                transformed = transform(transformed, server_op)
                if transformed is None:
                    # Op became a no-op (e.g., deleted already-deleted char)
                    break

            if transformed is not None:
                new_seq = server_seq + 1
                self.store.save_op(document_id, transformed, new_seq, client_id, client_seq)
                self.redis.set(f"doc_seq:{document_id}", new_seq)
                return new_seq

        return None  # no-op, nothing to broadcast

OT Insight

The lock in the sequencer is short - it only covers the transform and increment, not the broadcast or persistence roundtrip. Holding the lock for less than 2ms means thousands of operations per second per document before contention becomes an issue. For documents with extremely high write rates, sharding by paragraph region is the next scaling lever.

CRDT Merge Logic

CRDTs sidestep the transform function entirely by using identifiers that impose a total order on all characters. The merge is a set union - simply take all characters from both replicas and let the ToText linearization handle ordering. This makes CRDTs ideal for offline-first scenarios:

// CRDT offline merge - no server coordination required
func MergeOfflineEdits(local, remote *CRDTDoc) *CRDTDoc {
    merged := &CRDTDoc{
        Chars: make(map[CharID]*Char),
        Root:  local.Root,
    }
    // Copy local state
    for id, c := range local.Chars {
        cp := *c
        merged.Chars[id] = &cp
    }
    // Merge remote - tombstone deletion wins, new chars are added
    merged.Merge(remote)
    return merged
}

// Assign a globally unique, causally-ordered ID
func NewCharID(clientID string, lamportClock int64) CharID {
    return CharID{
        ClientID:  clientID,
        Timestamp: lamportClock,
        Seq:       int(atomic.AddInt64(&localSeq, 1)),
    }
}

CRDT Insight

The CRDT character ID encodes causal order (Lamport clock) and breaks ties with client ID. This ensures all clients linearize the document identically regardless of the order they receive operations. The trade-off is storage - tombstoned characters accumulate forever, requiring periodic compaction (garbage collection after all clients have acknowledged the deletion).

Cursor Position Transformation

Cursor positions are not just display data - they encode the user’s intent about where their next edit will land. After any remote operation is applied, all local cursors must be transformed through the same position-shift logic:

def transform_cursor(cursor_pos: int, op: Op) -> int:
    """
    Adjust cursor_pos after applying op to the document.
    Called for every peer cursor after each operation is applied.
    """
    if isinstance(op, Insert):
        # An insertion at or before our cursor pushes us right
        if op.pos <= cursor_pos:
            return cursor_pos + 1
    elif isinstance(op, Delete):
        if op.pos < cursor_pos:
            return cursor_pos - 1
        elif op.pos == cursor_pos:
            # The character at our cursor was deleted - stay in place
            return cursor_pos
    return cursor_pos

Scaling and Performance

Scaling a real-time collaboration system is fundamentally different from scaling a stateless API. The statefulness of document sessions means we can’t just add more instances and let load balancers distribute evenly - we must route all traffic for a given document to the same Operation Service instance. This is the same constraint databases face with leader-based replication.

The primary scaling strategy is horizontal partitioning by document_id. Each Operation Service instance owns a shard of the document space, determined by consistent hashing. When the cluster grows, documents are migrated between shards using a hot handoff: the new instance replays the operation log from the Document Store, then the load balancer atomically updates its routing table.

Horizontal scaling and document partitioning

# Capacity estimation

CONCURRENT_DOCS        = 5_000_000   # active doc sessions at peak
EDITORS_PER_DOC        = 3           # average concurrent editors
KEYSTROKES_PER_SEC     = 2           # per active editor
CURSOR_UPDATES_PER_SEC = 30          # per editor, throttled

# Operations per second
ops_per_sec = CONCURRENT_DOCS * EDITORS_PER_DOC * KEYSTROKES_PER_SEC
# = 30,000,000 ops/sec total

# Each Operation Service instance handles ~50,000 ops/sec
op_service_instances = ops_per_sec // 50_000
# = 600 instances

# Cursor updates dwarf operations
cursor_msgs_per_sec = CONCURRENT_DOCS * EDITORS_PER_DOC * CURSOR_UPDATES_PER_SEC
# = 450,000,000 msgs/sec - handled via Redis Pub/Sub fan-out

# WebSocket connections
ws_connections = CONCURRENT_DOCS * EDITORS_PER_DOC
# = 15,000,000 connections - ~5,000 per gateway instance = 3,000 gateway instances

# Storage (operations)
op_size_bytes        = 128
ops_per_day          = ops_per_sec * 86_400
storage_per_day_tb   = (ops_per_day * op_size_bytes) / 1e12
# = ~330 TB/day - requires aggressive compaction and TTL on old operations

How Google Does It

Google Docs originally used Jupiter, an OT-based server from the 1995 Xerox PARC research paper. Google Wave (2009) attempted a federated, open OT protocol and failed in the market but produced breakthrough engineering. Modern Google Docs uses a hybrid: OT for character-level operations with a centralized sequencer per document, and a CRDT-like approach for structural operations like table edits and formatting. The 2022 switch to WebAssembly-compiled collaborative editing library cut CPU overhead by 60% while maintaining the same convergence guarantees.

Failure Modes and Recovery

Failure	Detection	Impact	Recovery
Operation Service crash	Health probe timeout (2s)	In-flight ops not yet acked by clients are lost	Clients retransmit unacked ops; new instance replays doc from store and reprocesses
WebSocket Gateway crash	TCP RST or timeout	All connected clients disconnected	Clients reconnect with last-acked seq; gateway replays delta from in-memory buffer or Document Store
Document Store write failure	DB error on op persist	Operation appears committed to client but is not durable	Client retransmits on next heartbeat; idempotency key (client_id + client_seq) prevents duplicate application
Network partition (client offline)	Connection timeout	Client edits locally without server sync	On reconnect, client sends full CRDT state diff; server merges and broadcasts resolution to all peers
Redis Pub/Sub failure	Subscribe error	Cursor updates and presence stop broadcasting	Fall back to polling presence endpoint at 2s interval; no impact on document correctness
Sequence counter divergence	Checksum mismatch between replicas	Document divergence - two clients see different text	Circuit breaker triggers document reload: all clients fetch canonical state from Document Store and re-sync

Most Dangerous Failure Mode

Sequence counter divergence is the most catastrophic failure because it is silent - clients continue editing without knowing they are working on different versions of the document. By the time the divergence is detected (typically via a periodic checksum comparison), hundreds of operations may have been applied to diverged states. The recovery - forcing all clients to reload - causes visible disruption. Prevention requires strong consistency on the sequence counter with fencing tokens to detect stale writes.

Comparison of Approaches

Approach	Latency	Complexity	Failure Mode	Best Fit
Operational Transformation (OT)	Low (central sequencer)	High (transform function correctness is subtle)	Sequence divergence if sequencer has bugs	Documents with real-time low-latency requirement
CRDT	Low (no coordination)	Medium (ID scheme + compaction)	Tombstone accumulation, memory growth	Offline-first, eventually connected scenarios
Lock-based (pessimistic)	High (wait for lock)	Low (simple)	Deadlock, thundering herd on popular docs	Internal tools, low-concurrency scenarios
Last-write-wins	Very low	Very low	Data loss - concurrent edits overwrite each other	Append-only or single-user scenarios

For a system targeting millions of concurrent editors with sub-100ms latency and offline support, we would choose a hybrid: OT with a central sequencer for the primary editing path (because it produces the most predictable document structure and lower storage overhead than CRDTs), with a CRDT-based offline reconciliation layer for handling reconnection after network partitions longer than the operation buffer window. This is architecturally similar to what Notion uses - OT-style operations over a WebSocket with CRDT-compatible merge logic for offline edits.

The pure CRDT path (as used by Automerge, Yjs, and Apple Notes in iOS 16+) is compelling for mobile-first products where offline-first is a primary requirement. Its advantage is the elimination of the central sequencer, which removes a single point of failure and simplifies the server architecture at the cost of higher client-side memory and processing overhead.

Key Takeaways

Optimistic local application is non-negotiable - users must see their own keystrokes instantly, which requires applying operations locally before server confirmation and reconciling afterward
OT and CRDT solve the same problem differently - OT transforms operations in time (adjust positions based on concurrent ops), CRDTs transform the data structure itself (globally unique IDs make merge trivially correct)
Document-level statefulness is the key design constraint - all operations for a single document must flow through one coordinator to maintain sequence ordering; this drives the entire sharding strategy
Cursor broadcasting is a separate tier - cursor updates are ephemeral, high-frequency, and loss-tolerant; running them through the same path as document operations would create unnecessary backpressure
Snapshot plus delta is the right storage model - snapshotting every N operations bounds replay cost at document load time without sacrificing the complete operation history needed for version control
Offline reconciliation requires CRDT-compatible semantics - even if you use OT for online editing, offline edits must be represented in a way that can be merged without replaying the full history
The transform function is where correctness lives - every edge case in the OT transform function is a potential divergence bug; property-based testing against the TP1/TP2 invariants is the only reliable verification approach
Capacity estimation reveals the cursor problem - cursor updates outpace document operations by 15x at scale; they require a separate fan-out infrastructure (Redis Pub/Sub or a dedicated presence cluster) rather than going through the operation pipeline

Frequently Asked Questions

Q: Why not just use a distributed database with strong consistency for the document state?

Strong consistency (serializable transactions) would mean every keystroke blocks on a quorum write before the client sees it applied. At 100ms round-trip, typing would feel like responding to dial-up internet. The entire point of OT and CRDTs is to decouple the user experience from network latency by allowing optimistic local application and reconciling asynchronously.

Q: Why not use WebRTC peer-to-peer instead of a central server?

WebRTC P2P eliminates the server from the critical path, which sounds appealing. But it introduces three hard problems: you need a central server anyway for signaling and NAT traversal, P2P networks degrade badly when any peer has a poor connection (one slow peer blocks everyone), and you lose the central sequence number that makes OT transformation straightforward. Systems like ShareDB (used by CodeSandbox) use a server-authoritative approach precisely because it simplifies the convergence guarantee.

Q: How does Google Docs handle very large documents with thousands of operations?

At some threshold (typically 50,000+ operations since last snapshot), Google Docs creates a new document snapshot and starts a fresh operation log. Older operations are archived but not deleted - they are needed for the version history UI. For documents with hundreds of collaborators in a short period (like a shared meeting notes doc), the Operation Service may also rate-limit operations from any single client to prevent one fast typist from starving others.

Q: Why not use a last-write-wins CRDT (like Redis) instead of a sequence CRDT?

Last-write-wins (LWW) uses timestamps to resolve conflicts, which means concurrent edits at the same logical position are resolved by clock - the later timestamp wins and the earlier edit is discarded. For a document editor, discarding any user’s edit is unacceptable. Sequence CRDTs (like Logoot or LSEQ) assign unique IDs to every character and never discard anything - they only append to the character set, with deletions being soft-tombstones. This is the right tradeoff for collaborative text editing.

Q: How do you handle formatting operations (bold, italic) alongside character insertions?

Formatting operations are range-based (apply bold to characters 10-20) rather than position-based, which requires a different transform function. A character insertion inside the formatted range must expand the range; a deletion must shrink it. In practice, most systems represent formatting as a separate layer using a Peritext or Quill Delta-style model, where formatting spans are stored independently of character positions and transformed separately from insertion/deletion operations.

Q: What happens if two users simultaneously delete the same character?

In OT, the transform function handles this explicitly - the second delete is transformed against the first and becomes a no-op (the character is already gone). In CRDTs, tombstoning is idempotent - marking a character deleted twice has the same result as marking it once. Both approaches handle this gracefully; it is one of the easier cases compared to concurrent insertions at the same position.

Interview Questions

Q: Walk me through what happens when two users simultaneously type a character at position 5 in a document. How does the system ensure both characters appear and neither is lost?

Expected depth: Candidate should explain optimistic local application, the OT transform function (insert at position 5 gets transformed to position 6 for the second writer), the central sequencer’s role in establishing which operation arrived first, and how clients receive the transformed remote operation and apply it to their already-modified local state. Strong candidates mention tie-breaking by client_id for determinism when both ops arrive at the exact same server timestamp.

Q: A user edits offline for 20 minutes, making 500 character insertions. They reconnect. How do you merge their changes with the 10,000 operations other users submitted while they were offline?

Expected depth: Candidate should discuss the CRDT approach (unique character IDs make merge a set union), or the OT approach (replay the 500 offline ops against each of the 10,000 server ops via the transform function - O(nm) transforms). Strong candidates note the O(nm) cost is why long offline sessions are expensive and why some systems cap offline buffering at a time limit before forcing a hard sync.

Q: How would you design the presence system so that cursor positions from 50 simultaneous collaborators are visible in real time without overwhelming clients with cursor update messages?

Expected depth: Candidate should mention throttling cursor updates at the sender (30fps cap), using Redis Pub/Sub for fan-out to avoid the operation service bottleneck, and client-side interpolation to smooth cursor movement between updates. Advanced answers discuss prioritizing cursor updates from collaborators who are actively typing and dropping stale updates if a newer one is pending in the queue.

Q: The operation sequencer is a stateful single-instance component per document. How do you handle its failure without losing in-flight operations?

Expected depth: Candidate should explain that the document store is the source of truth, and the sequencer’s in-memory state can be rebuilt by replaying operations from the last snapshot. In-flight operations (submitted but not yet sequenced) are retransmitted by clients using their idempotency key (client_id + client_seq). The new sequencer instance picks up at the last persisted seq number. Strong candidates discuss fencing tokens to prevent the old instance from writing after failover.

Q: How would you scale the system to support 100 simultaneous editors on a single document (like a live event’s shared notes)?

Expected depth: Candidate should note that the document-level lock becomes a bottleneck at high concurrency - 100 simultaneous editors means 200+ operations per second all serialized through one lock. Solutions: partition the document into regions (paragraphs) with per-region locks (used by Figma for canvas objects), buffer and batch operations in short time windows (5ms batches), or switch to a pure CRDT where no central sequencer is needed. Strong candidates discuss the trade-off: regional locking is fast but requires transform functions across region boundaries.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access

Unlock Full Article