Build a Distributed Shopping Cart with Guest Session Merge
distributed-systems scalability databases
System Design Deep Dive
Distributed Shopping Cart with Session Merge
Merging guest and authenticated cart state without losing items or creating duplicates at 50K RPS
Think of a shopping cart like a physical basket at a supermarket. When you walk in without a loyalty card, you still get a basket and fill it with items. The moment you swipe your card at checkout, the cashier needs to reconcile your basket with any pending online order on your account. That reconciliation is instantaneous, automatic, and must never drop your avocados. Now imagine 50,000 shoppers doing this simultaneously, from 15 different countries, on devices ranging from a 2017 Android to a MacBook Pro - and the “cashier” is a distributed cluster with no single node that knows the full picture.
A naive implementation stores the cart in the server session: session["cart"] = [...]. This works at 100 requests per second on a single box. At 50,000 RPS across 20 application servers, you have no session affinity guarantee. Every request may hit a different server, each carrying a stale or empty cart. Sticky sessions feel like a solution - pin each user to one server. But a server restart or autoscale event wipes your cart silently. Amazon discovered this the hard way in the early 2000s: session-affinity-based carts lost millions of dollars in abandoned items every week during holiday traffic spikes.
The guest-to-authenticated merge problem adds another layer of complexity. A guest user (identified by a browser fingerprint or anonymous session_id) builds a cart over multiple days - 7 items, 3 sessions, 2 devices. When they finally sign in, we must merge that guest cart into their authenticated cart without losing items, without creating phantom duplicates if the same product appears in both carts, and without blocking the login response for more than 50ms. Worse, what if they are logged in on their phone and a guest on their laptop simultaneously, and they add to both before merging?
Cart state is fundamentally distributed state with convergence requirements. It is not a simple key-value lookup. We need to solve for durable guest cart persistence with TTL, conflict-free merge on login, concurrent modification safety, and sub-50ms read latency simultaneously. Each constraint points in a different direction, and the interesting engineering lives at those intersections.
Requirements and Constraints
Functional Requirements
- Guest users can add, remove, and update item quantities in their cart without logging in
- Guest cart persists across browser sessions using a fingerprint or anonymous token stored in a cookie
- On login, guest cart merges into the authenticated user’s cart automatically
- Merge strategy handles overlapping items (same SKU in both carts) without duplicates
- Cart survives application server restarts, deployments, and node failures
- Cart supports concurrent modifications from multiple devices (same user logged in on phone and laptop)
- Each cart operation (add/remove/update) is idempotent - retries do not double-count
- Cart items expire after 30 days for guest users (TTL-based cleanup)
Non-Functional Requirements
- Read latency: p99 under 20ms for cart fetch
- Write latency: p99 under 50ms for item add/update
- Availability: 99.99% (4 nines) - cart loss directly causes revenue loss
- Throughput: 50,000 cart operations per second at peak (Black Friday scale)
- Scale: 200 million active guest carts, 80 million authenticated carts
- Merge operation: complete within 200ms of login event
- Storage: average cart holds 8 items at 500 bytes each = 4KB per cart; 280M carts = 1.1TB raw cart data
- Concurrency: support 5 simultaneous device sessions per user with last-write-wins or union semantics
Constraints
- We are not designing the payment or checkout system - cart is pre-checkout state only
- Product catalog and pricing are owned by a separate service; cart stores SKU and quantity only
- We assume cookie support is available; we do not design for cookie-blocked environments
- Single datacenter design initially; multi-region replication is out of scope but architected for
High-Level Architecture
The cart system decomposes into six major components that each handle a distinct concern.
The Cart API Gateway is the single entry point for all cart operations. It authenticates requests, extracts the identity (guest session_id or authenticated user_id), and routes to the Cart Service. It also handles idempotency key injection for retryable operations.
The Cart Service is the core application logic layer. It owns the cart read/write operations, manages the merge workflow, and orchestrates calls to the storage tier. This is a stateless service that scales horizontally.
The Guest Cart Store is a Redis cluster holding all anonymous carts, keyed by session_id. Every entry carries a 30-day TTL. Redis is the right choice here because guest carts are ephemeral, access patterns are simple key lookups, and TTL management is native.
The Authenticated Cart Store is a Postgres cluster holding carts for logged-in users. Postgres gives us ACID transactions for the merge operation, which must be atomic: read both carts, compute union, write merged result, delete guest cart.
The Cart Event Log is an append-only Kafka topic that captures every cart mutation as an event. This is the source of truth for analytics, and it enables cart event sourcing: any cart can be reconstructed from its event history if the primary store is corrupted.
The Session Merge Service consumes login events from the auth system and triggers the merge workflow asynchronously, allowing the login API to return immediately while merge proceeds in the background within the 200ms SLA.
Separating guest carts (Redis, volatile, TTL-managed) from authenticated carts (Postgres, durable, transactional) lets you optimize each store for its actual access pattern instead of forcing both onto the same system with conflicting requirements.
Guest Cart Store Design
The guest cart store handles 70-80% of all cart traffic because most e-commerce visitors never log in during a session. Its job is simple: hold mutable cart state for anonymous users, expire it automatically, and do so with single-digit millisecond reads.
Redis is the right storage layer here. Think of Redis as a whiteboard in a break room: everyone can read and write it instantly, but if you unplug it, everything is gone. The persistence requirements for guest carts are relaxed - a 30-day TTL means we can tolerate losing a few minutes of data on a Redis restart if we configure appendonly yes with fsync everysec. We accept occasional cart loss for the same reason Snapchat accepts message loss: the cost of stronger durability (synchronous fsync, full AOF) exceeds the value of the data.
Each guest cart is stored as a Redis Hash, keyed by guest:{session_id}:cart. We use a Hash rather than a JSON blob so we can atomically increment a single item’s quantity without deserializing and reserializing the whole cart. The field name is the SKU; the value is a JSON-encoded item struct carrying quantity, added timestamp, and price snapshot.
# Guest cart structure in Redis
# Key: guest:{session_id}:cart
# Field: {sku_id}
# Value: JSON-encoded CartItem
HSET guest:abc123xyz:cart SKU-9001 '{"sku":"SKU-9001","qty":2,"price_cents":4999,"added_at":"2026-06-09T10:00:00Z"}'
HSET guest:abc123xyz:cart SKU-7002 '{"sku":"SKU-7002","qty":1,"price_cents":1299,"added_at":"2026-06-09T10:05:00Z"}'
# Set/refresh TTL to 30 days on every write
EXPIRE guest:abc123xyz:cart 2592000
# Atomic quantity increment (avoids read-modify-write race)
# We update the entire field JSON - use a Lua script for atomicity
EVAL "
local key = KEYS[1]
local sku = ARGV[1]
local delta = tonumber(ARGV[2])
local item = redis.call('HGET', key, sku)
if item then
local parsed = cjson.decode(item)
parsed['qty'] = parsed['qty'] + delta
if parsed['qty'] <= 0 then
redis.call('HDEL', key, sku)
else
redis.call('HSET', key, sku, cjson.encode(parsed))
end
redis.call('EXPIRE', key, 2592000)
return parsed['qty']
else
return 0
end
" 1 guest:abc123xyz:cart SKU-9001 1
# Fetch entire cart
HGETALL guest:abc123xyz:cart
# Delete guest cart after successful merge
DEL guest:abc123xyz:cart
The Lua script for quantity increment is critical. Without it, a concurrent add and remove on the same SKU from two browser tabs would race: both read quantity=2, one writes 3, one writes 1, and the final value depends on timing. The Lua script executes atomically on the Redis server - no other command runs between the HGET and HSET.
Storing cart data as a single JSON blob in Redis (GET/SET pattern) instead of a Hash forces every item add to deserialize and reserialize the entire cart. At 50K RPS with carts averaging 4KB, this is 200MB/s of unnecessary JSON parsing on the application servers. Use HSET/HGET and keep items as individual Hash fields.
Authenticated Cart Store Design
The authenticated cart store holds carts for logged-in users. Unlike guest carts, these persist indefinitely (until checkout or manual clear), require ACID transactions for the merge operation, and must handle concurrent writes from multiple devices.
Postgres is the right choice here because the merge operation is a multi-row transaction: read guest cart, read user cart, compute union, write merged cart, and delete guest cart - all atomically. Redis Transactions (MULTI/EXEC) are optimistic and can conflict; Postgres row-level locking gives us the serialization guarantees we need.
We use an event sourcing model for the authenticated cart. Rather than storing the current cart state as a mutable row, we append every cart event (ITEM_ADDED, ITEM_REMOVED, QUANTITY_UPDATED, CART_MERGED) to a cart_events table, and materialize the current state into a cart_snapshots table for fast reads. This gives us full audit history, the ability to replay carts if a snapshot is corrupted, and a natural fit for the Kafka event log.
-- Authenticated cart event sourcing schema
-- Event log: immutable append-only
CREATE TABLE cart_events (
id BIGSERIAL PRIMARY KEY,
user_id UUID NOT NULL,
event_type VARCHAR(30) NOT NULL
CHECK (event_type IN (
'ITEM_ADDED', 'ITEM_REMOVED', 'QUANTITY_UPDATED',
'CART_MERGED', 'CART_CLEARED', 'CART_CHECKED_OUT'
)),
sku_id VARCHAR(64),
quantity_delta INT, -- positive for add, negative for remove
price_cents INT, -- snapshot of price at event time
idempotency_key VARCHAR(128) UNIQUE, -- prevents duplicate events on retry
metadata JSONB, -- merge_source_session_id, device_id, etc.
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_cart_events_user_id ON cart_events (user_id, created_at DESC);
CREATE INDEX idx_cart_events_idempotency ON cart_events (idempotency_key)
WHERE idempotency_key IS NOT NULL;
-- Materialized snapshot for fast cart reads
CREATE TABLE cart_snapshots (
user_id UUID PRIMARY KEY,
items JSONB NOT NULL DEFAULT '{}', -- {sku_id: {qty, price_cents, added_at}}
last_event_id BIGINT NOT NULL,
item_count INT NOT NULL DEFAULT 0,
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_cart_snapshots_updated ON cart_snapshots (updated_at)
WHERE item_count > 0;
-- Guest cart reference table for merge tracking
CREATE TABLE guest_cart_merges (
id BIGSERIAL PRIMARY KEY,
user_id UUID NOT NULL,
session_id VARCHAR(128) NOT NULL,
merge_status VARCHAR(20) NOT NULL DEFAULT 'PENDING'
CHECK (merge_status IN ('PENDING', 'COMPLETED', 'FAILED', 'SKIPPED')),
items_merged INT DEFAULT 0,
conflict_count INT DEFAULT 0,
started_at TIMESTAMPTZ DEFAULT NOW(),
completed_at TIMESTAMPTZ,
error_message TEXT
);
CREATE INDEX idx_guest_merges_user ON guest_cart_merges (user_id, merge_status);
CREATE INDEX idx_guest_merges_session ON guest_cart_merges (session_id);
The idempotency_key column on cart_events is the key to safe retries. Every client-side cart operation generates a UUID that travels through all retries. If the same key arrives twice (network retry after a 500 error), the UNIQUE constraint prevents the event from being inserted twice. The application catches the uniqueness violation and returns a 200 OK with the existing event - the operation is already committed.
Amazon uses a similar event sourcing pattern for their shopping cart service, described in their 2007 Dynamo paper. Cart items are modeled as add/remove events rather than mutable state, and the current cart is derived by replaying events. This approach naturally handles eventual consistency in a multi-region deployment because events are commutative - merging event logs from two replicas always produces the same result regardless of order.
Merge Strategy on Login
The merge operation is the hardest part of this system. It runs exactly once per login event - but in a distributed system “exactly once” is notoriously difficult. We design for at-least-once delivery with idempotency to achieve the effect of exactly-once.
Think of the merge like combining two grocery lists before a shopping trip. If your list has “apples x2” and your partner’s list has “apples x3”, do you want 2, 3, or 5 apples? The answer depends on business intent. For most e-commerce sites, the right answer is take the maximum quantity - assume the user intended to buy the larger amount, and let them reduce in the cart if needed. Amazon uses this strategy. Some retailers prefer union with sum (add quantities together). We make this configurable per merchant.
The merge algorithm runs in a Postgres transaction for atomicity:
# Cart merge algorithm - runs inside a Postgres transaction
# Handles: empty guest cart, empty user cart, overlapping SKUs, and concurrent merge attempts
import json
import uuid
from datetime import datetime, timezone
from typing import Optional
import psycopg2
from psycopg2.extras import DictCursor
import redis
MERGE_STRATEGY_MAX_QTY = "max" # take the larger quantity for conflicts
MERGE_STRATEGY_SUM_QTY = "sum" # add quantities together
MERGE_STRATEGY_PREFER_AUTH = "auth" # keep authenticated cart as-is, ignore guest
def merge_cart(
pg_conn,
redis_client: redis.Redis,
user_id: str,
session_id: str,
strategy: str = MERGE_STRATEGY_MAX_QTY,
idempotency_key: Optional[str] = None,
) -> dict:
"""
Atomically merge guest cart (Redis) into authenticated cart (Postgres).
Returns a summary of items merged and conflicts resolved.
"""
if not idempotency_key:
idempotency_key = str(uuid.uuid4())
guest_key = f"guest:{session_id}:cart"
with pg_conn.cursor(cursor_factory=DictCursor) as cur:
# Check if this merge was already completed (idempotency)
cur.execute(
"""
SELECT merge_status, items_merged, conflict_count
FROM guest_cart_merges
WHERE session_id = %s AND user_id = %s
AND merge_status = 'COMPLETED'
LIMIT 1
""",
(session_id, user_id),
)
existing = cur.fetchone()
if existing:
return {
"status": "already_merged",
"items_merged": existing["items_merged"],
"conflicts": existing["conflict_count"],
}
# Lock the merge record to prevent concurrent merge attempts for same session
cur.execute(
"""
INSERT INTO guest_cart_merges (user_id, session_id, merge_status)
VALUES (%s, %s, 'PENDING')
ON CONFLICT DO NOTHING
RETURNING id
""",
(user_id, session_id),
)
merge_row = cur.fetchone()
if not merge_row:
# Another process grabbed the lock - wait and return
return {"status": "merge_in_progress"}
merge_id = merge_row["id"]
# Fetch guest cart from Redis (outside the Postgres transaction is fine -
# we hold the merge lock so no other process will touch this session_id)
guest_items_raw = redis_client.hgetall(guest_key)
if not guest_items_raw:
# Guest cart is empty or already expired - nothing to merge
cur.execute(
"UPDATE guest_cart_merges SET merge_status='SKIPPED', completed_at=NOW() WHERE id=%s",
(merge_id,),
)
pg_conn.commit()
return {"status": "guest_cart_empty", "items_merged": 0}
guest_items = {
k.decode(): json.loads(v) for k, v in guest_items_raw.items()
}
# Fetch current authenticated cart snapshot
cur.execute(
"SELECT items, last_event_id FROM cart_snapshots WHERE user_id = %s FOR UPDATE",
(user_id,),
)
auth_row = cur.fetchone()
auth_items = auth_row["items"] if auth_row else {}
last_event_id = auth_row["last_event_id"] if auth_row else 0
# Compute merged cart
merged_items = dict(auth_items) # start with authenticated cart
items_added = 0
conflicts = 0
for sku, guest_item in guest_items.items():
if sku not in merged_items:
# SKU not in auth cart - simply add it
merged_items[sku] = guest_item
items_added += 1
else:
# SKU exists in both carts - apply conflict resolution strategy
conflicts += 1
auth_qty = merged_items[sku]["qty"]
guest_qty = guest_item["qty"]
if strategy == MERGE_STRATEGY_MAX_QTY:
merged_items[sku]["qty"] = max(auth_qty, guest_qty)
elif strategy == MERGE_STRATEGY_SUM_QTY:
merged_items[sku]["qty"] = auth_qty + guest_qty
elif strategy == MERGE_STRATEGY_PREFER_AUTH:
pass # keep auth cart as-is
# Always keep the earlier added_at to preserve order history
if guest_item.get("added_at", "") < merged_items[sku].get("added_at", ""):
merged_items[sku]["added_at"] = guest_item["added_at"]
# Write CART_MERGED event to event log
event_metadata = {
"merge_source_session_id": session_id,
"strategy": strategy,
"items_added": items_added,
"conflicts_resolved": conflicts,
}
cur.execute(
"""
INSERT INTO cart_events
(user_id, event_type, idempotency_key, metadata, created_at)
VALUES (%s, 'CART_MERGED', %s, %s, NOW())
RETURNING id
""",
(user_id, idempotency_key, json.dumps(event_metadata)),
)
new_event_id = cur.fetchone()[0]
# Upsert cart snapshot with merged state
cur.execute(
"""
INSERT INTO cart_snapshots (user_id, items, last_event_id, item_count, updated_at)
VALUES (%s, %s, %s, %s, NOW())
ON CONFLICT (user_id) DO UPDATE SET
items = EXCLUDED.items,
last_event_id = EXCLUDED.last_event_id,
item_count = EXCLUDED.item_count,
updated_at = EXCLUDED.updated_at
""",
(user_id, json.dumps(merged_items), new_event_id, len(merged_items)),
)
# Mark merge as completed
cur.execute(
"""
UPDATE guest_cart_merges
SET merge_status='COMPLETED', items_merged=%s,
conflict_count=%s, completed_at=NOW()
WHERE id=%s
""",
(items_added, conflicts, merge_id),
)
pg_conn.commit()
# Delete guest cart from Redis after successful Postgres commit
# If this Redis DEL fails, the next merge attempt sees the completed record and skips
redis_client.delete(guest_key)
return {
"status": "merged",
"items_merged": items_added,
"conflicts_resolved": conflicts,
"total_items": len(merged_items),
}
Notice the ordering: we commit to Postgres first, then delete from Redis. If the process crashes between the commit and the Redis DELETE, the next merge attempt will find the COMPLETED record in guest_cart_merges and skip. The guest cart in Redis will eventually expire via TTL. This is the correct choice - we can tolerate a stale guest cart in Redis for 30 days, but we cannot tolerate losing the merged authenticated cart.
The merge idempotency lock lives in Postgres alongside the cart data. A Redis-based lock would create a distributed transaction across two stores - if Redis returns the lock but Postgres fails, you have a lock with no cart. Keep the lock and the data in the same store.
Do not run the merge inside the login HTTP request handler with a 5-second timeout. If the guest cart has 50 items and Postgres is under load, the merge can take 300ms - making login feel slow. Always trigger merge via an async event (Kafka or a background job queue) and return the login response immediately. Show the merged cart on the next page load.
Distributed Cart State and Concurrency
A user logged into both their laptop and phone adds different items simultaneously. Both devices hold a stale cart snapshot. When device A writes, it must not silently overwrite device B’s concurrent write. This is the classic lost update problem, and it occurs in any system where reads and writes are not serialized.
The standard solution for single-node Postgres is SELECT FOR UPDATE. For a distributed deployment where multiple Cart Service instances handle requests, we use optimistic concurrency control with version vectors. Each cart snapshot carries a version integer. Every write includes the version the client last read. If the server-side version has advanced (another device wrote), the write is rejected with a 409 Conflict.
# Optimistic concurrency control for concurrent device writes
# Returns 409 if another device updated the cart since last read
def add_item_to_authenticated_cart(
pg_conn,
user_id: str,
sku_id: str,
quantity: int,
price_cents: int,
client_version: int,
idempotency_key: str,
) -> dict:
"""
Add item to authenticated cart with optimistic concurrency control.
client_version is the snapshot version the client last read.
"""
with pg_conn.cursor(cursor_factory=DictCursor) as cur:
# Check idempotency first - retry of a committed operation
cur.execute(
"SELECT id FROM cart_events WHERE idempotency_key = %s",
(idempotency_key,),
)
if cur.fetchone():
return {"status": "ok", "idempotent": True}
# Read current snapshot with row lock to prevent lost updates
cur.execute(
"SELECT items, last_event_id, version FROM cart_snapshots WHERE user_id = %s FOR UPDATE NOWAIT",
(user_id,),
)
row = cur.fetchone()
if row is None:
# No cart yet - client_version should be 0
if client_version != 0:
return {"status": "conflict", "server_version": 0}
current_items = {}
current_version = 0
else:
current_items = row["items"]
current_version = row["version"]
if client_version != current_version:
# Cart was modified by another device since client last read
return {
"status": "conflict",
"server_version": current_version,
"message": "Cart modified concurrently, please refresh",
}
# Apply the item add
if sku_id in current_items:
current_items[sku_id]["qty"] += quantity
else:
current_items[sku_id] = {
"sku": sku_id,
"qty": quantity,
"price_cents": price_cents,
"added_at": datetime.now(timezone.utc).isoformat(),
}
new_version = current_version + 1
# Insert event
cur.execute(
"""
INSERT INTO cart_events
(user_id, event_type, sku_id, quantity_delta, price_cents, idempotency_key, created_at)
VALUES (%s, 'ITEM_ADDED', %s, %s, %s, %s, NOW())
RETURNING id
""",
(user_id, sku_id, quantity, price_cents, idempotency_key),
)
event_id = cur.fetchone()[0]
# Update snapshot with new version
cur.execute(
"""
INSERT INTO cart_snapshots (user_id, items, last_event_id, item_count, version, updated_at)
VALUES (%s, %s, %s, %s, %s, NOW())
ON CONFLICT (user_id) DO UPDATE SET
items = EXCLUDED.items,
last_event_id = EXCLUDED.last_event_id,
item_count = EXCLUDED.item_count,
version = EXCLUDED.version,
updated_at = EXCLUDED.updated_at
""",
(user_id, json.dumps(current_items), event_id, len(current_items), new_version),
)
pg_conn.commit()
return {
"status": "ok",
"new_version": new_version,
"item_count": len(current_items),
}
When a client receives a 409 Conflict, it fetches the current cart, shows the user the updated state, and lets them retry the operation. This is the same pattern used by GitHub’s API for concurrent pull request updates.
Shopify’s cart API uses a similar version-based conflict detection pattern. Their cart has a token (opaque version identifier) returned on every read. A write that includes a stale token is rejected. The client is expected to merge the server state with the user’s intent before retrying. This prevents the “ghost item” bug where two browser tabs fight over cart state and items appear and disappear randomly.
Data Model
The full data model spans three stores: Redis for guest carts, Postgres for authenticated carts and event log, and a secondary read replica for analytics queries.
-- Full schema with indexes for distributed shopping cart system
-- cart_snapshots: current materialized cart state per user
CREATE TABLE cart_snapshots (
user_id UUID PRIMARY KEY,
items JSONB NOT NULL DEFAULT '{}',
last_event_id BIGINT NOT NULL DEFAULT 0,
item_count INT NOT NULL DEFAULT 0,
version INT NOT NULL DEFAULT 0,
total_value_cents BIGINT GENERATED ALWAYS AS (
-- Computed total from items JSONB - for quick display
(SELECT COALESCE(SUM((value->>'price_cents')::int * (value->>'qty')::int), 0)
FROM jsonb_each(items) AS t(key, value))
) STORED,
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- GIN index for JSONB item queries (e.g., "find all carts containing SKU-9001")
CREATE INDEX idx_snapshots_items_gin ON cart_snapshots USING GIN (items);
CREATE INDEX idx_snapshots_updated ON cart_snapshots (updated_at DESC)
WHERE item_count > 0;
-- cart_events: append-only event log for sourcing and audit
CREATE TABLE cart_events (
id BIGSERIAL PRIMARY KEY,
user_id UUID NOT NULL,
event_type VARCHAR(30) NOT NULL
CHECK (event_type IN (
'ITEM_ADDED', 'ITEM_REMOVED', 'QUANTITY_UPDATED',
'CART_MERGED', 'CART_CLEARED', 'CART_CHECKED_OUT'
)),
sku_id VARCHAR(64),
quantity_delta INT,
price_cents INT,
idempotency_key VARCHAR(128),
metadata JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (created_at);
-- Monthly partitions for time-series event retention
CREATE TABLE cart_events_2026_06 PARTITION OF cart_events
FOR VALUES FROM ('2026-06-01') TO ('2026-07-01');
CREATE TABLE cart_events_2026_07 PARTITION OF cart_events
FOR VALUES FROM ('2026-07-01') TO ('2026-08-01');
CREATE INDEX idx_events_user_time ON cart_events (user_id, created_at DESC);
CREATE UNIQUE INDEX idx_events_idempotency ON cart_events (idempotency_key)
WHERE idempotency_key IS NOT NULL;
-- guest_cart_merges: idempotency and audit for merge operations
CREATE TABLE guest_cart_merges (
id BIGSERIAL PRIMARY KEY,
user_id UUID NOT NULL,
session_id VARCHAR(128) NOT NULL,
merge_status VARCHAR(20) NOT NULL DEFAULT 'PENDING'
CHECK (merge_status IN ('PENDING', 'COMPLETED', 'FAILED', 'SKIPPED')),
items_merged INT DEFAULT 0,
conflict_count INT DEFAULT 0,
merge_strategy VARCHAR(20) DEFAULT 'max',
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ,
error_message TEXT,
UNIQUE (user_id, session_id)
);
CREATE INDEX idx_merges_user ON guest_cart_merges (user_id, merge_status);
CREATE INDEX idx_merges_pending ON guest_cart_merges (started_at)
WHERE merge_status = 'PENDING';
Sharding strategy: The cart_snapshots and cart_events tables are sharded by user_id using consistent hashing. With 10 Postgres shards and 200 million carts, each shard holds 20 million carts - manageable for a single Postgres instance. The shard for a given user_id is determined by shard_id = hash(user_id) % NUM_SHARDS. All events for a user land on the same shard as their snapshot, eliminating cross-shard transactions.
Guest carts in Redis are distributed using Redis Cluster with 16,384 hash slots. The key pattern guest:{session_id}:cart uses the session_id portion as the hash key, giving even distribution across nodes.
Partitioning cart_events by time (monthly partitions) rather than by user_id makes old event data cheap to drop: dropping a month’s partition is an O(1) metadata operation in Postgres, whereas deleting individual rows from an unpartitioned table at scale causes table bloat and write amplification from vacuuming.
Key Algorithms and Protocols
Consistent Hash Ring for Shard Routing
The Cart Service must route reads and writes to the correct Postgres shard without a central coordinator. A consistent hash ring ensures minimal key remapping when shards are added or removed during capacity changes.
# Consistent hash ring for cart shard routing
# Uses virtual nodes (vnodes) to ensure even distribution
import hashlib
from bisect import bisect_left, insort
from typing import Optional
class ConsistentHashRing:
"""
Maps user_id to a Postgres shard using consistent hashing with virtual nodes.
Virtual nodes (vnodes_per_node=150) ensure ~1% rebalance overhead per added shard.
Time complexity: O(log N) for lookup where N = total vnodes
Space complexity: O(N * vnodes_per_node)
"""
def __init__(self, vnodes_per_node: int = 150):
self.vnodes_per_node = vnodes_per_node
self._ring: list[int] = [] # sorted list of hash positions
self._nodes: dict[int, str] = {} # hash position -> shard name
def add_shard(self, shard_name: str) -> None:
"""Add a shard to the ring with virtual nodes."""
for i in range(self.vnodes_per_node):
vnode_key = f"{shard_name}:vnode:{i}".encode()
position = int(hashlib.md5(vnode_key).hexdigest(), 16)
insort(self._ring, position)
self._nodes[position] = shard_name
def remove_shard(self, shard_name: str) -> None:
"""Remove a shard and its virtual nodes from the ring."""
for i in range(self.vnodes_per_node):
vnode_key = f"{shard_name}:vnode:{i}".encode()
position = int(hashlib.md5(vnode_key).hexdigest(), 16)
self._ring.remove(position)
del self._nodes[position]
def get_shard(self, user_id: str) -> Optional[str]:
"""
Find the shard responsible for a given user_id.
Returns None if no shards are registered.
"""
if not self._ring:
return None
key_hash = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
# Find the first vnode position >= key_hash (clockwise on the ring)
idx = bisect_left(self._ring, key_hash)
if idx == len(self._ring):
idx = 0 # wrap around to the first vnode
return self._nodes[self._ring[idx]]
# Initialize ring with 10 shards
ring = ConsistentHashRing(vnodes_per_node=150)
for i in range(10):
ring.add_shard(f"cart-pg-shard-{i:02d}")
# Route a user's cart operation to the correct shard
shard = ring.get_shard("user-uuid-abc-123")
# Returns: "cart-pg-shard-04" (or whichever shard owns this hash)
TTL-Based Guest Cart Expiry and Cleanup
Redis handles TTL automatically, but we need a background sweep to clean up the guest_cart_merges records for expired sessions so the table does not grow unbounded.
# Background cleanup job for expired guest cart merge records
# Runs every hour via cron or a scheduled task
import psycopg2
from datetime import datetime, timedelta, timezone
def cleanup_expired_guest_carts(pg_conn, redis_client, max_age_days: int = 31):
"""
Delete guest_cart_merges records for sessions older than max_age_days.
Redis TTL handles the actual guest cart data - this cleans up Postgres metadata.
Also sweeps for PENDING merges that never completed (crash recovery).
"""
cutoff = datetime.now(timezone.utc) - timedelta(days=max_age_days)
stale_cutoff = datetime.now(timezone.utc) - timedelta(minutes=30)
with pg_conn.cursor() as cur:
# Clean up completed/skipped merges older than retention window
cur.execute(
"""
DELETE FROM guest_cart_merges
WHERE merge_status IN ('COMPLETED', 'SKIPPED', 'FAILED')
AND started_at < %s
""",
(cutoff,),
)
deleted = cur.rowcount
# Find PENDING merges stuck for more than 30 minutes (process crash)
cur.execute(
"""
SELECT id, user_id, session_id
FROM guest_cart_merges
WHERE merge_status = 'PENDING' AND started_at < %s
""",
(stale_cutoff,),
)
stale_pending = cur.fetchall()
# Re-queue stale PENDING merges for retry
requeued = 0
for row in stale_pending:
# Check if guest cart still exists in Redis
guest_key = f"guest:{row[2]}:cart"
if redis_client.exists(guest_key):
# Reset to PENDING so the merge service picks it up again
cur.execute(
"UPDATE guest_cart_merges SET merge_status='PENDING', started_at=NOW() WHERE id=%s",
(row[0],),
)
requeued += 1
else:
# Guest cart already expired - mark as SKIPPED
cur.execute(
"UPDATE guest_cart_merges SET merge_status='SKIPPED', completed_at=NOW() WHERE id=%s",
(row[0],),
)
pg_conn.commit()
return {"deleted": deleted, "requeued": requeued}
Cart Event Sourcing Replay
If a cart_snapshots row is corrupted or accidentally deleted, we can reconstruct the current cart state by replaying events from the event log. This is the core property that makes event sourcing valuable.
# Reconstruct cart snapshot from event log (used for recovery and debugging)
def rebuild_cart_from_events(pg_conn, user_id: str) -> dict:
"""
Replay all cart events for a user to reconstruct current cart state.
Handles ITEM_ADDED, ITEM_REMOVED, QUANTITY_UPDATED, CART_CLEARED, CART_MERGED.
Time complexity: O(E) where E is number of events for this user.
"""
with pg_conn.cursor(cursor_factory=DictCursor) as cur:
cur.execute(
"""
SELECT event_type, sku_id, quantity_delta, price_cents, metadata, created_at
FROM cart_events
WHERE user_id = %s
ORDER BY id ASC
""",
(user_id,),
)
events = cur.fetchall()
cart = {} # sku_id -> {qty, price_cents, added_at}
for event in events:
etype = event["event_type"]
sku = event["sku_id"]
if etype == "ITEM_ADDED":
if sku in cart:
cart[sku]["qty"] += event["quantity_delta"]
else:
cart[sku] = {
"sku": sku,
"qty": event["quantity_delta"],
"price_cents": event["price_cents"],
"added_at": event["created_at"].isoformat(),
}
elif etype == "ITEM_REMOVED":
if sku in cart:
cart[sku]["qty"] += event["quantity_delta"] # delta is negative
if cart[sku]["qty"] <= 0:
del cart[sku]
elif etype == "QUANTITY_UPDATED":
if sku in cart:
cart[sku]["qty"] = event["quantity_delta"] # absolute, not delta
elif etype == "CART_CLEARED" or etype == "CART_CHECKED_OUT":
cart = {} # clear all items
elif etype == "CART_MERGED":
# Merge events encode the merged state in metadata
meta = event["metadata"] or {}
# Items are replayed from the individual ITEM_ADDED events that follow;
# CART_MERGED is a marker event, not a full state snapshot
return cart
Event sourcing makes the merge operation auditable. When a customer calls support claiming “my items disappeared after login,” you can replay their event log, identify which CART_MERGED event ran, and show exactly which items were in each cart and which conflict resolution strategy was applied. Without the event log, this debugging is impossible.
Scaling and Performance
Back-of-Envelope Capacity Estimation - Black Friday Peak
Given:
- 50,000 cart operations per second (add/update/remove)
- 500,000 cart reads per second (product pages, cart icon badge)
- Average cart size: 8 items at ~500 bytes each = 4KB per cart
- 200M guest carts + 80M authenticated carts = 280M total
Read traffic:
- 500K reads/sec × 4KB avg response = 2 GB/s read bandwidth
- Redis can serve 100K+ ops/sec per node at <1ms p99
- Need: 5 Redis read replicas to handle 500K reads/sec at safe headroom
Write traffic:
- 50K writes/sec to Redis (guest cart ops)
- 20K writes/sec to Postgres (authenticated cart ops, ~40% of traffic is auth)
- Postgres: single primary handles ~10K TPS; need sharding
- With 10 shards: 2K TPS per shard (well within limits)
Storage:
- Redis (guest carts): 200M × 4KB = 800 GB - fits on 3 Redis nodes (300GB each) with replication
- Postgres (auth carts + events): 80M × 4KB snapshots = 320 GB
- Event log: 50K events/sec × 200 bytes/event × 86400 sec = 864 GB/day
- Event log retention: 90 days rolling window = ~78 TB total (use S3 archival after 30 days)
Merge traffic:
- Peak login rate during flash sale: ~5,000 logins/sec
- Each merge: 1 Redis HGETALL + 1 Postgres transaction ≈ 15ms
- 5,000 merges/sec × 15ms = peak merge concurrency ≈ 75 in-flight merges
- Easily handled by 5 Session Merge Service instances
Cache hit ratio:
- Cart reads are near-100% Redis cache hits for guest carts
- Authenticated cart: 95% Redis L1 cache hit (hot carts in Redis, cold in Postgres)
- Redis cache for auth carts: 80M × 4KB = 320 GB; use Redis with LRU eviction
The dominant bottleneck is Postgres write throughput during peak authenticated cart operations. We address this with three techniques:
Write batching: Instead of committing one event per HTTP request, the Cart Service buffers events in a local queue and flushes in batches of 100 every 10ms. This reduces Postgres round-trips by 100x at peak load. The tradeoff is a 10ms additional write latency, which is invisible to users but meaningfully reduces database pressure.
Read-through Redis cache for auth carts: On every authenticated cart read, we check Redis first. Cache miss rate is under 5% at steady state because active users repeatedly check their cart. We use a 24-hour TTL on Redis auth cart cache entries and invalidate on every write. This absorbs 95% of read traffic before it reaches Postgres.
Connection pooling via PgBouncer: At 50K RPS across 10 shards, naive connection-per-request would require 5,000 Postgres connections. PgBouncer in transaction mode limits this to 100 connections per shard, multiplexing 500 application connections through each pool.
Etsy handles a similar challenge with their cart system. During peak holiday traffic (Cyber Monday), their cart service routes 90% of reads from Redis and batches Postgres writes in 50ms windows. They published that this pattern reduced their cart database load by 80% during peak compared to direct Postgres reads. The key lesson: treat the cart as a read-heavy workload even though it feels like a write-heavy one, because every product page renders the cart item count badge.
Failure Modes and Recovery
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| Redis primary node crash | Redis Sentinel heartbeat miss within 30s | Guest cart reads return empty; writes fail | Sentinel promotes replica automatically; application falls back to Postgres for auth carts; guest cart ops return “cart temporarily unavailable” |
| Postgres shard crash | Health check endpoint fails; connection pool exhausted | Authenticated cart reads/writes for ~10% of users fail | Automatic failover to read replica promoted to primary; 30-60s downtime on affected shard |
| Network partition between Cart Service and Redis | Connection timeout after 100ms | All guest cart operations fail; merge blocked | Circuit breaker opens; return cached cart from Postgres if available; queue writes for replay when Redis recovers |
| Session Merge Service crash mid-merge | PENDING record in guest_cart_merges not updated for 30 minutes | Guest cart merge incomplete; user sees pre-merge cart | Cleanup job resets PENDING to retry; next login or scheduled retry completes the merge |
| Duplicate login events from auth system | guest_cart_merges unique constraint violation on (user_id, session_id) | Second merge attempt sees COMPLETED record and returns early | No action needed; idempotency constraint handles this transparently |
| Redis TTL bug: guest cart expires before merge | HGETALL returns empty after Redis Cluster failover resets TTL tracking | User loses guest cart items | Log event; show user “cart may have been lost” message; do not retry merge (no data to recover) |
The most common operational mistake is configuring Redis with maxmemory-policy allkeys-lru without accounting for guest cart TTLs. When Redis hits max memory, LRU eviction silently deletes recently-touched carts that happen to have been accessed less frequently than others. Use volatile-lru (evict only keys with TTLs set) so carts without TTLs are never evicted, and always set a TTL on every guest cart key.
Comparison of Approaches
| Approach | Cart Storage | Merge Strategy | Conflict Resolution | Best For |
|---|---|---|---|---|
| Session-affinity (sticky sessions) | Application server memory | N/A - cart lives on one server | Not applicable | Single-server deployments, under 1K RPS |
| Client-side cart (localStorage/cookie) | Browser storage | Merge on login via API call | Client sends full cart; server union | Fully serverless/JAMstack sites; no auth required |
| Redis-only (all carts) | Redis Hash per user/session | Lua script merge | Max-qty or sum-qty via Lua | High-speed e-commerce, eventual consistency acceptable |
| Postgres-only (all carts) | Rows in cart_items table | SQL transaction UPDATE | Database-enforced serialization | Financial accuracy required; lower throughput tolerance |
| Redis (guest) + Postgres (auth) - this design | Split by identity | Background async merge via Kafka | Configurable per merchant, persisted in events | Production e-commerce: high throughput + durability |
| CRDT-based distributed cart | Distributed CRDT nodes | Automatic CRDT merge (no coordination) | CRDT semantics (add-wins) | Multi-region active-active; very high geo-distributed traffic |
The Redis + Postgres split is the right production choice for most e-commerce platforms. Pure Redis sacrifices durability for authenticated carts - a Redis cluster outage during checkout can lose cart state that represents real customer intent. Pure Postgres cannot serve 500K read/sec at p99 under 20ms without a massive (and expensive) read replica fleet. The CRDT approach is elegant but adds operational complexity (CRDT libraries are not mature in all languages, and debugging CRDT merges in production is nontrivial). The split design gives you Redis’s speed for the 70-80% guest traffic and Postgres’s transactional guarantees for the critical merge and checkout path.
Key Takeaways
- Separate storage tiers by lifecycle: Guest carts are ephemeral and read-heavy - Redis is ideal. Authenticated carts require ACID merge transactions - Postgres is correct. Forcing both onto the same system optimizes neither.
- Cart event sourcing enables recovery and auditability: Appending every cart mutation as an event lets you replay, audit, and debug cart state without relying solely on the mutable snapshot. This pays off the first time a customer escalates a “my items disappeared” ticket.
- Merge idempotency must be enforced at the data layer: Using a
UNIQUEconstraint on(user_id, session_id)inguest_cart_mergesprevents double-merges regardless of how many times the merge service retries. Application-level deduplication is insufficient under concurrent retries. - TTL for guest carts is a business decision, not a technical one: 30 days is the standard, but some retailers use 7 days (lower storage cost) or 90 days (higher conversion from returning visitors). The expiry policy should be configurable per merchant, not hardcoded.
- Optimistic concurrency prevents lost updates across devices: Version-based conflict detection on cart snapshots ensures a phone and laptop writing concurrently produce a deterministic result - the last committed write wins and the other receives a 409 with the current state.
- The merge order matters: Commit to Postgres before deleting from Redis. If you delete from Redis first and the Postgres commit fails, you have lost the guest cart with no recovery path. If Postgres commits first and Redis delete fails, the 30-day TTL cleans it up.
- Back-of-envelope math drives architecture: 500K reads/sec × 4KB = 2 GB/s bandwidth. This number alone tells you that Postgres as the read path is impossible without a massive cache layer. Start with capacity estimation before choosing storage technology.
- Conflict resolution strategy is domain-specific: “Take the maximum quantity” is correct for most e-commerce carts. But for a grocery delivery app where items have weight limits, summing quantities can trigger over-limit errors at checkout. Design the merge strategy as a first-class policy, not a hardcoded heuristic.
The counter-intuitive lesson from this design is that the merge operation - the thing that seems hardest - is actually straightforward once you commit to making the guest and authenticated cart stores independent. The real complexity is in ensuring the merge is idempotent, auditable, and crash-safe across two different storage systems. Most teams underestimate this and build the merge as a simple read-update-delete, then spend months chasing “items disappeared on login” bugs in production.
Frequently Asked Questions
Q: Why not store all carts (guest and authenticated) in Redis for simplicity?
A: Redis gives you speed but not the transactional guarantees needed for the merge. The merge operation requires reading two carts, computing a union, writing the result, and deleting the guest cart - all atomically. Redis MULTI/EXEC is optimistic and can conflict under load; the retry logic becomes complex. More critically, authenticated cart data has real monetary value. An unexpected Redis outage or misconfigured maxmemory-policy causing evictions will delete cart state that a user is about to check out. Postgres’s write-ahead log and replication give you durability guarantees that Redis’s AOF cannot match for strong consistency requirements.
Q: Why not use a single unified database (say, DynamoDB) for all cart state?
A: DynamoDB handles the Redis role well (fast key-value reads with TTL), but its transaction model (DynamoDB Transactions) has a limit of 100 items per transaction, a 4MB payload limit, and additional cost per transactional read/write. More importantly, DynamoDB transactions do not give you the row-level locking semantics needed to prevent the lost update problem on concurrent device writes. Postgres’s SELECT FOR UPDATE gives deterministic serialization without the complexity of client-managed transaction retry logic.
Q: What happens if a user has both a guest cart and an authenticated cart, and they add an item on two devices simultaneously during the login event?
A: The merge service holds a row-level lock on the guest_cart_merges record for this (user_id, session_id) pair. Concurrent device writes to the authenticated cart during the merge are serialized by the SELECT FOR UPDATE on cart_snapshots. The merge writes a new snapshot with a new version number; the concurrent device write, if it was reading a stale version, receives a 409 Conflict and must retry with the post-merge snapshot. The guest cart items are never lost in this scenario - they are committed to Postgres before the merge returns.
Q: How do you handle a guest user who signs in on multiple devices before the first merge completes?
A: Each device’s login event carries the same session_id (from the cookie). The UNIQUE (user_id, session_id) constraint on guest_cart_merges ensures that only one merge process can hold the lock at a time. Subsequent login events for the same session see the COMPLETED or IN_PROGRESS record and return early. The second device gets the merged cart on its next cart fetch (which reads from cart_snapshots).
Q: Why use an async background merge instead of merging synchronously during login?
A: Synchronous merge during login creates a coupling between the auth service latency and the cart service latency. On a Black Friday where 5,000 users/sec are logging in, a cart service slowdown (Postgres under load, network hiccup) would cascade into login API timeouts. Users would see slow logins with no clear error. Async merge decouples these concerns: login completes in under 100ms, and the cart shows merged state on the next page load (200-500ms later). The user experience is identical from the user’s perspective, and the system is more resilient.
Q: Why does the merge strategy default to “take the maximum quantity” rather than summing quantities?
A: Summing creates a counter-intuitive user experience. If a user has 2 units of a phone case in their authenticated cart (added last week) and then adds 1 unit as a guest today (thinking they replaced the older cart), a sum merge gives them 3 units - but they only want 1 or 2. The maximum strategy is more conservative and matches user intent in the common case: the user wanted “at least N” of an item, and both carts represent different moments of that intent. Summing is appropriate for grocery/consumables use cases where adding the same item twice genuinely means wanting more.
Interview Questions
Q: Walk me through the full lifecycle of a shopping cart from guest browse to authenticated checkout.
Expected depth: Trace from anonymous cookie issuance and Redis HSET operations through fingerprint-based session_id assignment, TTL configuration, login event triggering the Kafka merge event, the async Session Merge Service reading from both Redis and Postgres, conflict resolution strategy selection, the idempotency lock on guest_cart_merges, the atomic Postgres transaction writing the merged cart_snapshots row, Redis cache invalidation, and finally the checkout service reading the finalized cart.
Q: How would you handle the case where two users share a device (a family computer) and each adds items as a guest before logging into their own accounts?
Expected depth: Each login should only claim the guest cart associated with the session_id present at the time of login. After user A logs in and the merge completes, the session_id cookie is rotated (new anonymous session begins). When user B logs in, they get a fresh empty guest session_id. The key design point is that session_id is rotated on logout and on login, preventing cross-contamination. Discuss the edge case where the browser session is shared without logout, and whether the business policy should merge or discard the incumbent guest cart.
Q: How do you scale the Postgres cart store to handle 80 million users across 20 shards during a flash sale?
Expected depth: Discuss consistent hashing for shard routing (why consistent hashing minimizes remapping on shard addition), PgBouncer transaction-mode connection pooling (why session mode is wrong for stateless cart operations), read replicas for cart reads with a cache-aside Redis layer in front, write batching via event buffering to reduce Postgres TPS, Postgres partitioned tables for cart_events with monthly range partitions enabling fast historical data archival to S3 (Parquet via pg_dump or logical replication to Redshift), and the capacity math showing why 10 shards at 2K TPS each is safe.
Q: The merge service crashes mid-operation for 1,000 users during a peak login event. How do you recover?
Expected depth: The guest_cart_merges table holds PENDING records for each in-progress merge. The cleanup job runs every 30 minutes and finds PENDING records older than 30 minutes. For each, it checks whether the guest cart Redis key still exists. If yes, it resets the record to PENDING (triggering retry by the merge service). If no (TTL expired), it marks SKIPPED. Discuss why you need idempotency on the re-enqueued merges (the merge could have partially committed before the crash), why the Postgres UNIQUE constraint is the safety net, and how you would alert on-call when PENDING record count spikes above a threshold.
Q: A product manager asks: “Can we show the user a preview of what their merged cart will look like before they confirm the login?” How would you design this?
Expected depth: The merge preview requires a dry-run of the merge algorithm without committing to Postgres. Add a dry_run=True parameter to the merge function that runs the same conflict resolution logic but returns the computed merged_items dict without writing to the database, emitting Kafka events, or acquiring the idempotency lock. This lets the login page display “You have 3 items from your guest session. Your cart already has 2 of those items. Click merge to combine them.” The preview endpoint must read both the current authenticated cart snapshot and the guest cart - it should be read-only and run against a replica to avoid load on the primary. Discuss the race condition: between preview display and user confirmation, the authenticated cart could change (another device added items). The confirmation step re-runs the full merge with the latest state, so the preview is best-effort, not a guarantee.
Premium Content
Unlock the full article along with everything else in the archive — all in one place.