Build Google Docs Real-Time Collaborative Editing
distributed-systems reliability scalability
System Design Deep Dive
Real-Time Collaborative Editing
The fundamental tension: every user must see their own keystrokes applied instantly, yet the document must converge to the same state across millions of concurrent editors.
Picture two colleagues writing on the same physical whiteboard from opposite ends of the room. They can both write simultaneously without bumping into each other, as long as they stay in separate areas. But the moment they both try to edit the same sentence - one crossing out a word while the other rewrites it - you need a referee, a way to reconcile two independent histories into one coherent result. Google Docs does this for millions of simultaneous writers across the internet, with a 90ms round-trip constraint and no referee standing by to adjudicate disputes.
The scale makes the challenge concrete: at peak, Google Docs serves tens of millions of concurrent document sessions. A popular shared document might have dozens of people editing simultaneously, each on different network connections with different latencies. Every keystroke a user types must feel instant on their own screen - that means applying it locally without waiting for server acknowledgment. Yet that same keystroke must somehow be reconciled with edits from every other user and result in a document that looks identical on all screens within a few hundred milliseconds.
Naive approaches collapse immediately under this constraint. Last-write-wins destroys work: if Alice types “Hello” and Bob simultaneously types “World” in the same position, one of them loses their edit entirely. Pessimistic locking prevents the problem but also prevents concurrency - only one user can type at a time, which is useless for a real-time collaboration tool. Simple merging strategies like operational queuing break when network partitions occur or when a user goes offline and makes edits that need to be reconciled later.
The forces in tension are: latency vs consistency (applying locally is fast but potentially wrong), offline support vs convergence (users need to edit without connectivity, but their changes must merge correctly later), and simplicity vs correctness (simpler algorithms break on edge cases that real users hit constantly). We need to solve three core problems: conflict-free concurrent editing (two edits to the same location must both survive and be applied in a way that preserves both users’ intent), cursor position stability (when Alice inserts text before Bob’s cursor, Bob’s cursor must jump forward automatically), and offline convergence (edits made without network access must reconcile correctly when connectivity returns).
Requirements and Constraints
Functional Requirements
- Multiple users can edit the same document simultaneously with their changes visible to all others in real time
- Each user’s cursor position and selection range is broadcast to all other active collaborators
- Users can edit offline and have their changes merged automatically when they reconnect
- Full document version history with per-operation attribution and the ability to restore any prior state
- Presence indicators showing which users are currently active in the document and their cursor positions
- Document is always in a readable, consistent state - no partial operations or half-applied edits visible to any user
Non-Functional Requirements
- Operation propagation latency under 100ms for 99th percentile from client submission to all connected peers seeing it
- 99.99% availability - editing must degrade gracefully, never block on a single server failure
- Support for millions of concurrently active documents, each potentially with dozens of simultaneous editors
- Strong eventual consistency - all clients that receive the same set of operations must converge to identical document state, regardless of operation arrival order
- Cursor updates propagated within 50ms to minimize cursor teleportation on peers’ screens
Constraints
- No pessimistic locking - users must never be blocked from typing by another user’s activity
- Operations must commute - applying operation A then B must yield the same result as B then A after transformation
- The convergence guarantee is the hard invariant - the system may sacrifice ordering guarantees but never convergence
High-Level Architecture
The system is composed of seven major components. The Client SDK runs in the browser, applying operations optimistically to the local document state and queuing them for transmission. The WebSocket Gateway manages persistent connections and routes messages to the correct document session. The Operation Service sequences incoming operations and applies the OT or CRDT transformation logic. The OT/CRDT Engine is the mathematical core that transforms concurrent operations so they can be safely applied. The Document Store persists the canonical document state, operation log, and version snapshots. The Presence Service tracks which users are active and their cursor positions. The Cursor Broadcast Service fans out cursor updates to all peers with minimal latency.
When a user types a character, the Client SDK immediately applies it to the local document (optimistic application) and emits an operation to the WebSocket Gateway over the persistent connection. The gateway routes the operation to the Operation Service responsible for that document’s session. The Operation Service assigns a sequence number, transforms the operation against any concurrent operations that arrived first, and persists it to the Document Store. The transformed operation is then broadcast to all other connected clients via the WebSocket Gateway. Each receiving client applies the operation to their own document state, using the same transformation logic to ensure convergence.
The operation service is stateful per document - a single service instance owns a document session for the duration it is active. This eliminates distributed coordination overhead on the critical path. Operations for the same document are processed sequentially by one instance, making transformation straightforward. The statefulness is tolerated because document sessions are ephemeral and can be rebuilt from the operation log on failover.
Component Deep Dives
The Operation Engine
The Operation Engine is the heart of the system - the component that makes concurrent edits safe. Think of it like a court reporter at a deposition where multiple witnesses are speaking at once: the reporter must transcribe every word accurately, even when two people speak simultaneously, and produce a single coherent transcript that faithfully represents what everyone said.
There are two dominant approaches to conflict-free collaborative editing: Operational Transformation (OT) and Conflict-free Replicated Data Types (CRDTs). Google Docs originally used OT, while Figma and Notion have moved toward CRDT-based approaches.
Operational Transformation works by transforming an operation’s parameters against other concurrent operations before applying it. If Alice inserts “X” at position 5 and Bob simultaneously inserts “Y” at position 5, the server transforms Bob’s operation to position 6 (after Alice’s insertion) before broadcasting it. The transform function is the critical primitive:
# Operational Transformation - core transform function
# Operations are: Insert(pos, char) or Delete(pos)
from dataclasses import dataclass
from typing import Union
@dataclass
class Insert:
pos: int
char: str
client_id: str
seq: int # client-local sequence number
@dataclass
class Delete:
pos: int
client_id: str
seq: int
Op = Union[Insert, Delete]
def transform(op: Op, against: Op) -> Op:
"""
Transform `op` as if `against` has already been applied.
Returns a new Op with adjusted position.
OT property: apply(apply(doc, against), transform(op, against))
== apply(apply(doc, op), transform(against, op))
"""
if isinstance(op, Insert) and isinstance(against, Insert):
if against.pos < op.pos:
# against inserts before op's position - shift right
return Insert(op.pos + 1, op.char, op.client_id, op.seq)
elif against.pos == op.pos:
# tie-break by client_id to ensure deterministic ordering
if against.client_id < op.client_id:
return Insert(op.pos + 1, op.char, op.client_id, op.seq)
return op
if isinstance(op, Insert) and isinstance(against, Delete):
if against.pos < op.pos:
# a deletion before us shifts our position left
return Insert(op.pos - 1, op.char, op.client_id, op.seq)
return op
if isinstance(op, Delete) and isinstance(against, Insert):
if against.pos <= op.pos:
# an insertion at or before us shifts our position right
return Delete(op.pos + 1, op.client_id, op.seq)
return op
if isinstance(op, Delete) and isinstance(against, Delete):
if against.pos < op.pos:
return Delete(op.pos - 1, op.client_id, op.seq)
elif against.pos == op.pos:
# both deleted the same character - op becomes a no-op
return None # idempotent delete
return op
return op
def apply(doc: list, op: Op) -> list:
"""Apply an operation to a document represented as a character list."""
if op is None:
return doc
doc = doc[:]
if isinstance(op, Insert):
doc.insert(op.pos, op.char)
elif isinstance(op, Delete):
if op.pos < len(doc):
doc.pop(op.pos)
return doc
CRDT-based approaches take a different path: instead of transforming operations, they design the data structure so that concurrent operations are always commutative and idempotent by construction. A common approach is a Sequence CRDT where every character has a globally unique ID and tombstones are used for deletions. No coordination is required because the merge is defined mathematically:
// CRDT Character - each char has a unique globally-ordered ID
type CharID struct {
ClientID string
Timestamp int64
Seq int
}
type Char struct {
ID CharID
Value rune
Deleted bool
After CharID // ID of the character this one follows
}
type CRDTDoc struct {
Chars map[CharID]*Char
Root CharID // sentinel head
}
// Merge two CRDT states - commutative and idempotent
func (d *CRDTDoc) Merge(remote *CRDTDoc) {
for id, remoteChar := range remote.Chars {
if local, exists := d.Chars[id]; exists {
// Tombstone wins - deletion is permanent
if remoteChar.Deleted {
local.Deleted = true
}
} else {
// New character from remote - add it
c := *remoteChar
d.Chars[id] = &c
}
}
}
// Linearize the CRDT into a readable string
func (d *CRDTDoc) ToText() string {
// Build adjacency: after -> []chars sorted by ID
children := map[CharID][]*Char{}
for _, c := range d.Chars {
children[c.After] = append(children[c.After], c)
}
// Sort children by (Timestamp, ClientID) for determinism
var result []rune
var walk func(id CharID)
walk = func(id CharID) {
cs := children[id]
sortChars(cs) // sort by CharID for deterministic order
for _, c := range cs {
if !c.Deleted {
result = append(result, c.Value)
}
walk(c.ID)
}
}
walk(d.Root)
return string(result)
}
The WebSocket Gateway
The WebSocket Gateway is the nervous system of the collaboration architecture. It manages the persistent long-lived connections between clients and the server - think of it as the switchboard in a large telephone exchange, where thousands of callers need to be connected to the right conference room instantly.
Each WebSocket Gateway instance manages thousands of concurrent connections. The gateway uses document_id-based routing - all connections for the same document are affinity-routed to the same Operation Service instance. This is achieved with consistent hashing on document_id at the load balancer level. When a client sends an operation, the gateway parses the document_id from the message header and routes it to the correct downstream instance with minimal overhead.
Connection management handles the messy reality of mobile and unreliable networks. Clients implement exponential backoff reconnection with a maximum of 30 seconds between attempts. On reconnect, the client sends its last acknowledged operation sequence number, and the server replays any missed operations as a compressed delta. The gateway maintains a short in-memory buffer (30 seconds of operations) per document to serve fast replays without hitting the Document Store for minor network blips.
The Document Store
The Document Store has a dual responsibility: it must serve fast reads for clients loading a document, and it must provide an immutable, replayable operation log for version history and conflict resolution. Think of it like a bank ledger - the current balance is computed from the transaction history, but you also keep snapshots (monthly statements) to avoid replaying ten years of transactions every time someone checks their balance.
We use a snapshot-plus-delta storage model. The canonical document state is snapshotted every 500 operations or every 10 minutes, whichever comes first. Between snapshots, individual operations are stored as deltas in an append-only operations table. Loading a document means fetching the latest snapshot and replaying the operations that followed it - a bounded, fast operation.
-- Document storage schema
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
title TEXT NOT NULL DEFAULT 'Untitled Document',
owner_id UUID NOT NULL REFERENCES users(id),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
is_deleted BOOLEAN NOT NULL DEFAULT FALSE
);
-- Immutable operation log - append-only, never updated
CREATE TABLE operations (
id BIGSERIAL,
document_id UUID NOT NULL REFERENCES documents(id),
seq BIGINT NOT NULL, -- global sequence within document
client_seq INT NOT NULL, -- client-local sequence (for ACK)
client_id UUID NOT NULL, -- which client submitted this
op_type TEXT NOT NULL, -- 'insert' | 'delete' | 'retain'
op_data JSONB NOT NULL, -- {pos, char} or {pos, len}
applied_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (document_id, seq)
) PARTITION BY HASH (document_id);
-- Periodic snapshots to bound replay cost
CREATE TABLE document_versions (
id BIGSERIAL PRIMARY KEY,
document_id UUID NOT NULL REFERENCES documents(id),
seq_at BIGINT NOT NULL, -- snapshot taken after this seq
content TEXT NOT NULL, -- full document text at this seq
char_count INT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE (document_id, seq_at)
);
CREATE INDEX idx_ops_doc_seq ON operations (document_id, seq);
CREATE INDEX idx_versions_doc ON document_versions (document_id, seq_at DESC);
The operations table is hash-partitioned by document_id. This distributes write load across shards while keeping all operations for a single document on the same partition - critical for efficient replay. We use a composite primary key of (document_id, seq) so that sequence-ordered scans are physically local.
The Presence and Cursor Service
The Presence and Cursor Service solves a different problem than the Operation Engine: it must broadcast ephemeral, high-frequency positional data to all collaborators with minimal latency. Unlike document operations - which must be durable and ordered - cursor positions are stateless and can be dropped. If a cursor update is lost, the next update (arriving 50ms later) will supersede it.
Cursor positions are maintained in Redis using per-document hash maps keyed by client_id. Each client sends a cursor update on every selection change, throttled to 30 updates per second. The Cursor Broadcast Service subscribes to Redis Pub/Sub channels (one per document) and fans out updates to all connected WebSocket clients. Because cursor data is ephemeral, it never touches the Document Store.
When a document operation is applied, the Presence Service also transforms all active cursor positions through the same OT transform logic. If Alice inserts a character at position 10, every cursor at position 10 or higher must be shifted right. This transformation happens synchronously with operation application so that broadcasted cursor positions are always valid relative to the current document state.
Data Model
The data model maps directly to the three core concerns: documents (the thing being edited), operations (the immutable record of every edit), and versions (the periodic snapshots for fast loading).
-- Partitioned operations table for scale
-- Each shard holds ~10-20% of all documents
CREATE TABLE operations_p0 PARTITION OF operations
FOR VALUES WITH (modulus 8, remainder 0);
CREATE TABLE operations_p1 PARTITION OF operations
FOR VALUES WITH (modulus 8, remainder 1);
-- ... p2 through p7
-- Presence tracking (Redis-backed, but mirrored here for audit)
CREATE TABLE presence (
document_id UUID NOT NULL REFERENCES documents(id),
client_id UUID NOT NULL,
user_id UUID REFERENCES users(id),
cursor_pos INT,
selection_start INT,
selection_end INT,
color TEXT, -- assigned color for cursor display
last_seen TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (document_id, client_id)
);
-- Document sharing and permissions
CREATE TABLE document_access (
document_id UUID NOT NULL REFERENCES documents(id),
user_id UUID NOT NULL REFERENCES users(id),
role TEXT NOT NULL CHECK (role IN ('owner', 'editor', 'viewer')),
granted_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (document_id, user_id)
);
The operation lifecycle flows from client submission to global sequence assignment. The Operation Service receives an operation with a client_id and client_seq. It acquires the document’s current sequence counter (stored in Redis for speed), transforms the operation if needed, increments the counter, and writes the operation to Postgres with the new global seq. The client receives an ACK with the global seq, allowing it to confirm the operation was committed and update its local state accordingly.
Key Algorithms and Protocols
OT Transform Function
The transform function is the mathematical guarantee that makes OT work. Every OT implementation must satisfy the TP1 property: given operations O1 and O2 on the same document state, transform(O1, O2) applied after O2 must yield the same result as transform(O2, O1) applied after O1. Violating this property causes divergence - clients see different documents.
# Server-side OT sequencer
# Transforms incoming ops against all concurrent ops with higher seq numbers
from collections import defaultdict
class OperationSequencer:
def __init__(self, doc_store, redis_client):
self.store = doc_store
self.redis = redis_client
def submit(self, document_id: str, op: Op, client_seq: int, client_id: str) -> int:
"""
Transform op against all server ops the client hasn't seen yet,
assign it a global sequence number, persist it, and return the seq.
"""
lock_key = f"doc_lock:{document_id}"
with self.redis.lock(lock_key, timeout=2):
# Get current server seq for this document
server_seq = int(self.redis.get(f"doc_seq:{document_id}") or 0)
# Find all ops this client hasn't seen (submitted after client_seq)
pending_ops = self.store.get_ops_after(document_id, client_seq)
# Transform incoming op against each pending server op
transformed = op
for server_op in pending_ops:
transformed = transform(transformed, server_op)
if transformed is None:
# Op became a no-op (e.g., deleted already-deleted char)
break
if transformed is not None:
new_seq = server_seq + 1
self.store.save_op(document_id, transformed, new_seq, client_id, client_seq)
self.redis.set(f"doc_seq:{document_id}", new_seq)
return new_seq
return None # no-op, nothing to broadcast
The lock in the sequencer is short - it only covers the transform and increment, not the broadcast or persistence roundtrip. Holding the lock for less than 2ms means thousands of operations per second per document before contention becomes an issue. For documents with extremely high write rates, sharding by paragraph region is the next scaling lever.
CRDT Merge Logic
CRDTs sidestep the transform function entirely by using identifiers that impose a total order on all characters. The merge is a set union - simply take all characters from both replicas and let the ToText linearization handle ordering. This makes CRDTs ideal for offline-first scenarios:
// CRDT offline merge - no server coordination required
func MergeOfflineEdits(local, remote *CRDTDoc) *CRDTDoc {
merged := &CRDTDoc{
Chars: make(map[CharID]*Char),
Root: local.Root,
}
// Copy local state
for id, c := range local.Chars {
cp := *c
merged.Chars[id] = &cp
}
// Merge remote - tombstone deletion wins, new chars are added
merged.Merge(remote)
return merged
}
// Assign a globally unique, causally-ordered ID
func NewCharID(clientID string, lamportClock int64) CharID {
return CharID{
ClientID: clientID,
Timestamp: lamportClock,
Seq: int(atomic.AddInt64(&localSeq, 1)),
}
}
The CRDT character ID encodes causal order (Lamport clock) and breaks ties with client ID. This ensures all clients linearize the document identically regardless of the order they receive operations. The trade-off is storage - tombstoned characters accumulate forever, requiring periodic compaction (garbage collection after all clients have acknowledged the deletion).
Cursor Position Transformation
Cursor positions are not just display data - they encode the user’s intent about where their next edit will land. After any remote operation is applied, all local cursors must be transformed through the same position-shift logic:
def transform_cursor(cursor_pos: int, op: Op) -> int:
"""
Adjust cursor_pos after applying op to the document.
Called for every peer cursor after each operation is applied.
"""
if isinstance(op, Insert):
# An insertion at or before our cursor pushes us right
if op.pos <= cursor_pos:
return cursor_pos + 1
elif isinstance(op, Delete):
if op.pos < cursor_pos:
return cursor_pos - 1
elif op.pos == cursor_pos:
# The character at our cursor was deleted - stay in place
return cursor_pos
return cursor_pos
Scaling and Performance
Scaling a real-time collaboration system is fundamentally different from scaling a stateless API. The statefulness of document sessions means we can’t just add more instances and let load balancers distribute evenly - we must route all traffic for a given document to the same Operation Service instance. This is the same constraint databases face with leader-based replication.
The primary scaling strategy is horizontal partitioning by document_id. Each Operation Service instance owns a shard of the document space, determined by consistent hashing. When the cluster grows, documents are migrated between shards using a hot handoff: the new instance replays the operation log from the Document Store, then the load balancer atomically updates its routing table.
# Capacity estimation
CONCURRENT_DOCS = 5_000_000 # active doc sessions at peak
EDITORS_PER_DOC = 3 # average concurrent editors
KEYSTROKES_PER_SEC = 2 # per active editor
CURSOR_UPDATES_PER_SEC = 30 # per editor, throttled
# Operations per second
ops_per_sec = CONCURRENT_DOCS * EDITORS_PER_DOC * KEYSTROKES_PER_SEC
# = 30,000,000 ops/sec total
# Each Operation Service instance handles ~50,000 ops/sec
op_service_instances = ops_per_sec // 50_000
# = 600 instances
# Cursor updates dwarf operations
cursor_msgs_per_sec = CONCURRENT_DOCS * EDITORS_PER_DOC * CURSOR_UPDATES_PER_SEC
# = 450,000,000 msgs/sec - handled via Redis Pub/Sub fan-out
# WebSocket connections
ws_connections = CONCURRENT_DOCS * EDITORS_PER_DOC
# = 15,000,000 connections - ~5,000 per gateway instance = 3,000 gateway instances
# Storage (operations)
op_size_bytes = 128
ops_per_day = ops_per_sec * 86_400
storage_per_day_tb = (ops_per_day * op_size_bytes) / 1e12
# = ~330 TB/day - requires aggressive compaction and TTL on old operations
Google Docs originally used Jupiter, an OT-based server from the 1995 Xerox PARC research paper. Google Wave (2009) attempted a federated, open OT protocol and failed in the market but produced breakthrough engineering. Modern Google Docs uses a hybrid: OT for character-level operations with a centralized sequencer per document, and a CRDT-like approach for structural operations like table edits and formatting. The 2022 switch to WebAssembly-compiled collaborative editing library cut CPU overhead by 60% while maintaining the same convergence guarantees.
Failure Modes and Recovery
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| Operation Service crash | Health probe timeout (2s) | In-flight ops not yet acked by clients are lost | Clients retransmit unacked ops; new instance replays doc from store and reprocesses |
| WebSocket Gateway crash | TCP RST or timeout | All connected clients disconnected | Clients reconnect with last-acked seq; gateway replays delta from in-memory buffer or Document Store |
| Document Store write failure | DB error on op persist | Operation appears committed to client but is not durable | Client retransmits on next heartbeat; idempotency key (client_id + client_seq) prevents duplicate application |
| Network partition (client offline) | Connection timeout | Client edits locally without server sync | On reconnect, client sends full CRDT state diff; server merges and broadcasts resolution to all peers |
| Redis Pub/Sub failure | Subscribe error | Cursor updates and presence stop broadcasting | Fall back to polling presence endpoint at 2s interval; no impact on document correctness |
| Sequence counter divergence | Checksum mismatch between replicas | Document divergence - two clients see different text | Circuit breaker triggers document reload: all clients fetch canonical state from Document Store and re-sync |
Sequence counter divergence is the most catastrophic failure because it is silent - clients continue editing without knowing they are working on different versions of the document. By the time the divergence is detected (typically via a periodic checksum comparison), hundreds of operations may have been applied to diverged states. The recovery - forcing all clients to reload - causes visible disruption. Prevention requires strong consistency on the sequence counter with fencing tokens to detect stale writes.
Comparison of Approaches
| Approach | Latency | Complexity | Failure Mode | Best Fit |
|---|---|---|---|---|
| Operational Transformation (OT) | Low (central sequencer) | High (transform function correctness is subtle) | Sequence divergence if sequencer has bugs | Documents with real-time low-latency requirement |
| CRDT | Low (no coordination) | Medium (ID scheme + compaction) | Tombstone accumulation, memory growth | Offline-first, eventually connected scenarios |
| Lock-based (pessimistic) | High (wait for lock) | Low (simple) | Deadlock, thundering herd on popular docs | Internal tools, low-concurrency scenarios |
| Last-write-wins | Very low | Very low | Data loss - concurrent edits overwrite each other | Append-only or single-user scenarios |
For a system targeting millions of concurrent editors with sub-100ms latency and offline support, we would choose a hybrid: OT with a central sequencer for the primary editing path (because it produces the most predictable document structure and lower storage overhead than CRDTs), with a CRDT-based offline reconciliation layer for handling reconnection after network partitions longer than the operation buffer window. This is architecturally similar to what Notion uses - OT-style operations over a WebSocket with CRDT-compatible merge logic for offline edits.
The pure CRDT path (as used by Automerge, Yjs, and Apple Notes in iOS 16+) is compelling for mobile-first products where offline-first is a primary requirement. Its advantage is the elimination of the central sequencer, which removes a single point of failure and simplifies the server architecture at the cost of higher client-side memory and processing overhead.
Key Takeaways
- Optimistic local application is non-negotiable - users must see their own keystrokes instantly, which requires applying operations locally before server confirmation and reconciling afterward
- OT and CRDT solve the same problem differently - OT transforms operations in time (adjust positions based on concurrent ops), CRDTs transform the data structure itself (globally unique IDs make merge trivially correct)
- Document-level statefulness is the key design constraint - all operations for a single document must flow through one coordinator to maintain sequence ordering; this drives the entire sharding strategy
- Cursor broadcasting is a separate tier - cursor updates are ephemeral, high-frequency, and loss-tolerant; running them through the same path as document operations would create unnecessary backpressure
- Snapshot plus delta is the right storage model - snapshotting every N operations bounds replay cost at document load time without sacrificing the complete operation history needed for version control
- Offline reconciliation requires CRDT-compatible semantics - even if you use OT for online editing, offline edits must be represented in a way that can be merged without replaying the full history
- The transform function is where correctness lives - every edge case in the OT transform function is a potential divergence bug; property-based testing against the TP1/TP2 invariants is the only reliable verification approach
- Capacity estimation reveals the cursor problem - cursor updates outpace document operations by 15x at scale; they require a separate fan-out infrastructure (Redis Pub/Sub or a dedicated presence cluster) rather than going through the operation pipeline
Frequently Asked Questions
Q: Why not just use a distributed database with strong consistency for the document state?
Strong consistency (serializable transactions) would mean every keystroke blocks on a quorum write before the client sees it applied. At 100ms round-trip, typing would feel like responding to dial-up internet. The entire point of OT and CRDTs is to decouple the user experience from network latency by allowing optimistic local application and reconciling asynchronously.
Q: Why not use WebRTC peer-to-peer instead of a central server?
WebRTC P2P eliminates the server from the critical path, which sounds appealing. But it introduces three hard problems: you need a central server anyway for signaling and NAT traversal, P2P networks degrade badly when any peer has a poor connection (one slow peer blocks everyone), and you lose the central sequence number that makes OT transformation straightforward. Systems like ShareDB (used by CodeSandbox) use a server-authoritative approach precisely because it simplifies the convergence guarantee.
Q: How does Google Docs handle very large documents with thousands of operations?
At some threshold (typically 50,000+ operations since last snapshot), Google Docs creates a new document snapshot and starts a fresh operation log. Older operations are archived but not deleted - they are needed for the version history UI. For documents with hundreds of collaborators in a short period (like a shared meeting notes doc), the Operation Service may also rate-limit operations from any single client to prevent one fast typist from starving others.
Q: Why not use a last-write-wins CRDT (like Redis) instead of a sequence CRDT?
Last-write-wins (LWW) uses timestamps to resolve conflicts, which means concurrent edits at the same logical position are resolved by clock - the later timestamp wins and the earlier edit is discarded. For a document editor, discarding any user’s edit is unacceptable. Sequence CRDTs (like Logoot or LSEQ) assign unique IDs to every character and never discard anything - they only append to the character set, with deletions being soft-tombstones. This is the right tradeoff for collaborative text editing.
Q: How do you handle formatting operations (bold, italic) alongside character insertions?
Formatting operations are range-based (apply bold to characters 10-20) rather than position-based, which requires a different transform function. A character insertion inside the formatted range must expand the range; a deletion must shrink it. In practice, most systems represent formatting as a separate layer using a Peritext or Quill Delta-style model, where formatting spans are stored independently of character positions and transformed separately from insertion/deletion operations.
Q: What happens if two users simultaneously delete the same character?
In OT, the transform function handles this explicitly - the second delete is transformed against the first and becomes a no-op (the character is already gone). In CRDTs, tombstoning is idempotent - marking a character deleted twice has the same result as marking it once. Both approaches handle this gracefully; it is one of the easier cases compared to concurrent insertions at the same position.
Interview Questions
Q: Walk me through what happens when two users simultaneously type a character at position 5 in a document. How does the system ensure both characters appear and neither is lost?
Expected depth: Candidate should explain optimistic local application, the OT transform function (insert at position 5 gets transformed to position 6 for the second writer), the central sequencer’s role in establishing which operation arrived first, and how clients receive the transformed remote operation and apply it to their already-modified local state. Strong candidates mention tie-breaking by client_id for determinism when both ops arrive at the exact same server timestamp.
Q: A user edits offline for 20 minutes, making 500 character insertions. They reconnect. How do you merge their changes with the 10,000 operations other users submitted while they were offline?
Expected depth: Candidate should discuss the CRDT approach (unique character IDs make merge a set union), or the OT approach (replay the 500 offline ops against each of the 10,000 server ops via the transform function - O(nm) transforms). Strong candidates note the O(nm) cost is why long offline sessions are expensive and why some systems cap offline buffering at a time limit before forcing a hard sync.
Q: How would you design the presence system so that cursor positions from 50 simultaneous collaborators are visible in real time without overwhelming clients with cursor update messages?
Expected depth: Candidate should mention throttling cursor updates at the sender (30fps cap), using Redis Pub/Sub for fan-out to avoid the operation service bottleneck, and client-side interpolation to smooth cursor movement between updates. Advanced answers discuss prioritizing cursor updates from collaborators who are actively typing and dropping stale updates if a newer one is pending in the queue.
Q: The operation sequencer is a stateful single-instance component per document. How do you handle its failure without losing in-flight operations?
Expected depth: Candidate should explain that the document store is the source of truth, and the sequencer’s in-memory state can be rebuilt by replaying operations from the last snapshot. In-flight operations (submitted but not yet sequenced) are retransmitted by clients using their idempotency key (client_id + client_seq). The new sequencer instance picks up at the last persisted seq number. Strong candidates discuss fencing tokens to prevent the old instance from writing after failover.
Q: How would you scale the system to support 100 simultaneous editors on a single document (like a live event’s shared notes)?
Expected depth: Candidate should note that the document-level lock becomes a bottleneck at high concurrency - 100 simultaneous editors means 200+ operations per second all serialized through one lock. Solutions: partition the document into regions (paragraphs) with per-region locks (used by Figma for canvas objects), buffer and batch operations in short time windows (5ms batches), or switch to a pure CRDT where no central sequencer is needed. Strong candidates discuss the trade-off: regional locking is fast but requires transform functions across region boundaries.
Premium Content
Unlock the full article along with everything else in the archive — all in one place.