Build Instagram Stories Expiry and Archival Pipeline
data-engineering scalability databases
System Design Deep Dive
Instagram Stories Expiry Pipeline
Expiring 500 million Stories per day without a sweep that stalls your database
Think of a library where every book self-destructs 24 hours after checkout. That sounds simple until you realize 500 million books are being checked out every day, each on its own countdown clock. When the timer hits zero, you need to: stop serving the book, archive it somewhere cold in case the borrower wants it later, and update every index that referenced it - all without the librarians noticing any slowdown.
That’s the Instagram Stories expiry problem. Stories feel ephemeral to users, but the engineering underneath is anything but. At 500 million Stories posted per day, expiry is not a background task - it is one of the highest-throughput write workloads the platform runs. Every 24 hours, 500 million items need to be removed from the hot serving path, their media moved to archival storage, and their metadata index cleaned up. Miss the window and users see a “story expired” error mid-view. Delete too aggressively and you lose data users wanted to archive.
The naive approach - a cron job that SELECTs all stories where created_at < NOW() - INTERVAL 24 HOURS and deletes them - works at 10,000 stories per day. At 500 million per day it locks tables, spikes read I/O, and competes directly with story serving. You’d essentially run a full table scan every 24 hours on your hottest database, at peak traffic hours, to delete rows you could have tracked at insert time.
The forces in tension are: expiry precision vs. sweep cost, storage cost vs. retrieval latency for archives, and throughput of the deletion pipeline vs. impact on live reads. We need to solve for TTL-based expiry tracking, sweep-based deletion that doesn’t hurt serving, tiered storage migration to cold object stores, and lazy hydration of archived Stories on demand.
Requirements and Constraints
Functional Requirements
- Stories expire exactly 24 hours after creation and become invisible to non-author viewers
- Expired Stories are moved to archival storage and remain accessible to the author
- Authors can view and re-share archived Stories
- Expiry applies to all media: images, videos, and sticker overlays
- The system must handle both automatic expiry and user-initiated deletion
Non-Functional Requirements
- 500 million Stories created per day (~5,800/second average, ~20,000/second at peak)
- Expiry latency SLA: Stories should stop serving within 60 seconds of the 24-hour mark
- Archival pipeline must complete within 2 hours of expiry (eventual, not synchronous)
- Archive retrieval: cold Story loads within 3 seconds for authors
- System must not cause measurable latency regression on live story serving (p99 < 50ms)
- Storage: hot tier holds ~1 billion active stories at any time (2 days of buffer); cold tier holds 3+ years of archives
Constraints
- Story serving is read-heavy: 10:1 read/write ratio on the metadata store
- Media files (video segments, images) are immutable once uploaded - only metadata expires
- We assume Cassandra for metadata, S3-compatible object storage for media
High-Level Architecture
The pipeline has five major components working in concert: an Expiry Index that tracks when each story must expire, a Sweep Worker Fleet that processes expirations in batches, an Archival Pipeline that moves cold media to tiered storage, a Serving Gate that enforces expiry in real time, and a Lazy Hydration Service that restores archived Stories on author request.
Data flows like this: at upload time, the story metadata is written to Cassandra and a TTL record is inserted into the Expiry Index (a time-bucketed queue). Every minute, Sweep Workers pull the bucket for the current minute and mark those stories as expired in the metadata store. A separate Archival Worker picks up expired stories and migrates their media from hot S3 to Glacier-class cold storage. The Serving Gate checks expiry status on every read request; the check is a single cache lookup, not a database read.
The Expiry Index is the architectural lynchpin. It gives us O(1) writes at story creation and O(bucket_size) reads during sweep - no full table scan required. The key insight: we trade a small write overhead at creation time for a massive reduction in sweep complexity.
Pre-indexing expiry at creation time converts the expiry sweep from a full table scan into a targeted queue drain - this is the difference between O(total_stories) and O(stories_expiring_now).
The Expiry Index
The Expiry Index is a time-bucketed queue. Think of it as a wall of pigeonholes, each labeled with a minute-precision timestamp. When a Story is created, we drop its ID into the pigeonhole for created_at + 24 hours. Sweep Workers empty pigeonholes as their timestamps pass.
The bucket granularity is one minute. That means a maximum of 1,440 active buckets at any time. Each bucket holds the story IDs that expire during that minute. The Sweep Workers need to read and drain one bucket per minute - at 500M stories/day that’s ~347,000 story IDs per minute-bucket on average, peaking near ~1.2M during upload spikes.
# Expiry Index: Redis sorted set per time bucket, score = expiry timestamp
import redis
import time
r = redis.Redis()
def index_story_expiry(story_id: str, created_at: float, ttl_seconds: int = 86400):
expiry_ts = created_at + ttl_seconds
# Round down to minute bucket
bucket_key = f"expiry:{int(expiry_ts // 60) * 60}"
r.zadd(bucket_key, {story_id: expiry_ts})
# Set key TTL so buckets self-clean after 48h
r.expire(bucket_key, 172800)
def drain_bucket(bucket_minute_ts: int) -> list[str]:
bucket_key = f"expiry:{bucket_minute_ts}"
# Atomic: fetch all members, delete key
pipe = r.pipeline()
pipe.zrange(bucket_key, 0, -1)
pipe.delete(bucket_key)
results = pipe.execute()
return [sid.decode() for sid in results[0]]
The sorted set score is the exact expiry timestamp; the bucket key is the minute floor. This lets us use ZRANGE to pull all members (all stories expiring in that minute) in one round trip, then delete the bucket atomically with a pipeline.
If you use a single global sorted set keyed by expiry timestamp, ZRANGEBYSCORE queries compete with writes from ongoing uploads. Minute-bucketed keys shard the load and let buckets expire via Redis TTL without a sweep of the index itself.
At 1.2M story IDs per bucket, each ID averaging 16 bytes as a UUID string, the largest buckets hold ~20MB of data in Redis. This is acceptable. However, a single Redis node becomes a bottleneck at this rate. The fix: hash story IDs across 16 Redis shards, keying on story_id % 16. Each shard holds 1/16th of the bucket.
# Sharded expiry index
SHARD_COUNT = 16
def shard_key(story_id: str, bucket_minute_ts: int) -> str:
shard = hash(story_id) % SHARD_COUNT
return f"expiry:{bucket_minute_ts}:shard:{shard}"
def drain_bucket_sharded(bucket_minute_ts: int) -> list[str]:
all_ids = []
for shard in range(SHARD_COUNT):
key = f"expiry:{bucket_minute_ts}:shard:{shard}"
pipe = r.pipeline()
pipe.zrange(key, 0, -1)
pipe.delete(key)
results = pipe.execute()
all_ids.extend([sid.decode() for sid in results[0]])
return all_ids
Cassandra uses a similar time-bucketed approach for its TTL implementation: each row stores an expiry timestamp and a background compaction process sweeps expired cells by partition. The insight of co-locating expiry metadata with the data is the same - no global index scan needed.
The Sweep Worker Fleet
Sweep Workers are stateless consumers that run on a 60-second schedule. Each worker is responsible for a partition of the shards, ensuring no two workers process the same bucket shard simultaneously.
The sweep process for each worker:
- Acquire a distributed lease for
(bucket_minute, shard_id)using a RedisSET NX EXlock - Drain the assigned bucket shard from the Expiry Index
- Mark each story as expired in Cassandra in batches of 500
- Emit an expiry event to Kafka for downstream consumers (archival, notification)
- Release the lease
# Sweep Worker - single shard processing
from cassandra.cluster import Cluster
from kafka import KafkaProducer
import json
cluster = Cluster(['cassandra-host'])
session = cluster.connect('instagram')
producer = KafkaProducer(bootstrap_servers=['kafka:9092'],
value_serializer=lambda v: json.dumps(v).encode())
def mark_expired_batch(story_ids: list[str]):
# Cassandra batch update - mark expired, set archive_status
stmt = session.prepare(
"UPDATE stories SET expired = true, archive_status = 'pending' "
"WHERE story_id = ? IF expired = false"
)
batch_size = 500
for i in range(0, len(story_ids), batch_size):
batch = story_ids[i:i+batch_size]
futures = [session.execute_async(stmt, [sid]) for sid in batch]
for f in futures:
f.result() # wait, then emit event
for sid in batch:
producer.send('story-expired', {'story_id': sid, 'ts': time.time()})
def sweep_shard(bucket_minute_ts: int, shard_id: int):
lock_key = f"sweep-lock:{bucket_minute_ts}:{shard_id}"
acquired = r.set(lock_key, "1", nx=True, ex=300) # 5-min TTL
if not acquired:
return # another worker has this shard
try:
story_ids = drain_bucket_shard(bucket_minute_ts, shard_id)
if story_ids:
mark_expired_batch(story_ids)
finally:
r.delete(lock_key)
The lightweight check - IF expired = false - is a Cassandra LWT (Lightweight Transaction) that prevents double-processing if a worker crashes mid-sweep and another picks up the same IDs. For the archival pipeline, idempotency is handled on the Kafka consumer side using offset checkpointing.
The Cassandra LWT on the expiry update is the idempotency gate for the entire pipeline - it ensures a story can only transition expired=false to expired=true once, even if the sweep worker retries after a crash.
The Archival Pipeline
The Archival Pipeline consumes from the story-expired Kafka topic and orchestrates the move of story media from hot S3 to cold storage (S3 Glacier or equivalent).
# Kafka consumer group config for archival pipeline
group.id: story-archival-v2
auto.offset.reset: earliest
enable.auto.commit: false
max.poll.records: 200
session.timeout.ms: 30000
For each expired story, the archival pipeline:
- Reads story metadata from Cassandra to find all media asset keys
- Issues an S3
COPY_OBJECTto the cold bucket (Glacier Instant Retrieval) - Verifies the copy via ETag comparison
- Updates the metadata store with the new cold storage URI
- Issues
DELETEon the hot S3 key - Updates
archive_status = 'archived'in Cassandra
# Archival Worker - single story
import boto3
s3 = boto3.client('s3')
HOT_BUCKET = 'instagram-stories-hot'
COLD_BUCKET = 'instagram-stories-archive'
STORAGE_CLASS = 'GLACIER_IR' # Instant Retrieval: ms access, 40% cheaper
def archive_story(story_id: str):
row = session.execute(
"SELECT media_keys, user_id FROM stories WHERE story_id = %s", [story_id]
).one()
if not row:
return # already deleted or not found
for media_key in row.media_keys:
# Copy to cold storage
copy_source = {'Bucket': HOT_BUCKET, 'Key': media_key}
s3.copy_object(
CopySource=copy_source,
Bucket=COLD_BUCKET,
Key=media_key,
StorageClass=STORAGE_CLASS,
)
# Verify copy
hot_etag = s3.head_object(Bucket=HOT_BUCKET, Key=media_key)['ETag']
cold_etag = s3.head_object(Bucket=COLD_BUCKET, Key=media_key)['ETag']
assert hot_etag == cold_etag, f"ETag mismatch for {media_key}"
# Delete hot copy only after verified cold copy exists
s3.delete_object(Bucket=HOT_BUCKET, Key=media_key)
# Update metadata
session.execute(
"UPDATE stories SET archive_status = 'archived', "
"cold_storage_prefix = %s WHERE story_id = %s",
[f"s3://{COLD_BUCKET}/", story_id]
)
Delete the hot S3 object only after the cold copy is verified. If you delete first and the cold copy write fails (network partition to Glacier endpoint), the media is gone permanently. Always verify cold ETag before hot delete.
The archival pipeline runs 2 hours behind the expiry sweep by design. This provides a safety window: if a user disputes an expiry (clock skew, timezone edge case), support can restore from hot storage before the archival pipeline gets to it. After the window, recovery requires Glacier retrieval.
S3 Lifecycle Rules provide exactly this pattern natively - you can define a transition rule that moves objects to GLACIER_IR after N days. Instagram’s pipeline builds a layer on top to handle the metadata updates and user-facing archive retrieval that S3 Lifecycle Rules don’t provide.
The Serving Gate
The Serving Gate enforces expiry at read time. It sits between the API layer and the metadata store, providing a sub-millisecond expiry check without hitting Cassandra on every Story view.
The gate is a bloom-filter-backed cache. Expired story IDs are written into a Bloom filter on expiry. On each read request, the Serving Gate checks the Bloom filter before hitting Cassandra:
- Bloom filter says “definitely not expired” - serve from cache/Cassandra
- Bloom filter says “possibly expired” - check Cassandra
expiredflag (rare path) - Cassandra says
expired = true- return 410 Gone
# Serving Gate - expiry check with Bloom filter
from pybloom_live import ScalableBloomFilter
# Rotated every 48 hours, one filter per 24h window
expiry_bloom = ScalableBloomFilter(mode=ScalableBloomFilter.LARGE_SET_GROWTH,
error_rate=0.001)
def is_story_accessible(story_id: str, viewer_user_id: str, author_user_id: str) -> bool:
# Authors can always access their own stories (archived or not)
if viewer_user_id == author_user_id:
return True
# Fast path: Bloom filter negative = definitely not expired
if story_id not in expiry_bloom:
return True
# Slow path: Bloom filter positive = possibly expired, check Cassandra
row = session.execute(
"SELECT expired FROM stories WHERE story_id = %s", [story_id]
).one()
return row is not None and not row.expired
The Bloom filter false positive rate of 0.1% means 1 in 1,000 reads for non-expired stories triggers an unnecessary Cassandra read. Given 10:1 read/write ratio and 500M stories active, this is ~500K extra Cassandra reads per day - acceptable. The false negative rate is zero by Bloom filter design - a story that is expired is always found by the filter.
The Bloom filter fast path is possible only because story expiry is monotonic - a story transitions expired=false to expired=true exactly once and never back. This one-directional state machine makes Bloom filters the perfect expiry gate.
The Lazy Hydration Service
When an author accesses their archived Stories, the Lazy Hydration Service retrieves them from cold storage. The access pattern is rare (few users browse their archive), bursty (browsing sessions retrieve many stories at once), and latency-tolerant (3 second SLA vs 50ms for live stories).
# Lazy Hydration - serve archived story media
def get_archived_story_media(story_id: str, user_id: str) -> dict:
row = session.execute(
"SELECT media_keys, archive_status, cold_storage_prefix, user_id "
"FROM stories WHERE story_id = %s", [story_id]
).one()
if not row or row.user_id != user_id:
raise PermissionError("Not authorized")
if row.archive_status == 'pending':
# Still in hot storage, serve directly
return {'media_keys': row.media_keys, 'source': 'hot'}
if row.archive_status == 'archived':
# Generate presigned URLs for cold storage - valid 1 hour
presigned = []
for key in row.media_keys:
url = s3.generate_presigned_url(
'get_object',
Params={'Bucket': COLD_BUCKET, 'Key': key},
ExpiresIn=3600
)
presigned.append(url)
return {'presigned_urls': presigned, 'source': 'cold'}
raise ValueError(f"Unexpected archive_status: {row.archive_status}")
The cold storage class (Glacier Instant Retrieval) provides millisecond access latency at ~40% of the cost of standard S3. For archived Stories, this is the right tradeoff - authors access archives infrequently, but when they do, they expect reasonable latency.
Data Model
-- Stories metadata - Cassandra CQL
CREATE TABLE stories (
story_id UUID,
user_id BIGINT,
created_at TIMESTAMP,
expires_at TIMESTAMP,
media_keys LIST<TEXT>, -- S3 object keys for all media segments
expired BOOLEAN,
archive_status TEXT, -- null | pending | archived | deleted
cold_storage_prefix TEXT, -- null until archival completes
duration_secs INT,
view_count BIGINT,
PRIMARY KEY (story_id)
) WITH default_time_to_live = 604800 -- 7 days TTL safety net on Cassandra row
AND gc_grace_seconds = 86400;
-- Secondary index for user's story list (separate table for Cassandra efficiency)
CREATE TABLE user_stories (
user_id BIGINT,
created_at TIMESTAMP,
story_id UUID,
expired BOOLEAN,
PRIMARY KEY (user_id, created_at, story_id)
) WITH CLUSTERING ORDER BY (created_at DESC)
AND default_time_to_live = 604800;
-- Index for archive retrieval (separate table for cold reads)
CREATE TABLE user_archived_stories (
user_id BIGINT,
expired_at TIMESTAMP,
story_id UUID,
archive_status TEXT,
PRIMARY KEY (user_id, expired_at, story_id)
) WITH CLUSTERING ORDER BY (expired_at DESC);
The user_stories table is the hot read path - it powers the story ring on a user’s profile. Stories are naturally ordered by created_at DESC. The Cassandra TTL of 7 days acts as a safety net for the sweep pipeline, ensuring rows are eventually cleaned up even if the sweep misses them (e.g., due to extended Kafka consumer lag).
Indexing strategy: the user_id partition key clusters all stories for a user together, enabling efficient “get all stories for user X” queries. created_at as the clustering key gives time-ordered access without a sort. The archive table uses expired_at as clustering key to support “show me stories archived in the last 7 days” queries naturally.
Cassandra’s built-in TTL does not cascade to application-layer side effects - it deletes the row but does not trigger S3 object deletion or archive pipeline events. Always use the sweep pipeline for expiry logic; the Cassandra TTL is a cleanup fallback only.
Key Algorithms and Protocols
Time-Bucket Partitioning
The bucket granularity (1 minute) balances sweep overhead vs. expiry precision. With 500M stories/day and 1,440 minute-buckets, each bucket averages 347K story IDs. Finer granularity (per-second) would create 86,400 buckets but each smaller. The minute bucket is the sweet spot: sweep workers comfortably drain one bucket per minute, and the 60-second expiry tolerance is acceptable (stories expire “within 60 seconds of the 24-hour mark”, not exactly at T+24h).
# Bucket assignment - deterministic, no coordination required
import math
def bucket_for_story(created_at_unix: float, ttl_seconds: int = 86400) -> int:
expiry_exact = created_at_unix + ttl_seconds
# Floor to minute boundary
return math.floor(expiry_exact / 60) * 60
# Example: created at 1748000000.7, expires at 1748086400.7
# bucket = floor(1748086400.7 / 60) * 60 = 1748086380
# All stories expiring in that minute go into bucket 1748086380
Time complexity: O(1) for bucket insertion and lookup. Space: O(stories_active) for total index size.
Storage Lifecycle Policy
The transition between storage tiers follows a state machine:
CREATED -> HOT (active, served)
HOT -> SWEEP_PENDING (expiry index drained, not yet processed)
SWEEP_PENDING -> EXPIRED_HOT (sweep worker processed, media still hot)
EXPIRED_HOT -> ARCHIVING (archival pipeline picked up)
ARCHIVING -> COLD (media in Glacier, presigned URLs served)
COLD -> DELETED (user deleted archive or retention policy hit)
# State transition validation
VALID_TRANSITIONS = {
'active': ['sweep_pending'],
'sweep_pending': ['expired_hot'],
'expired_hot': ['archiving'],
'archiving': ['cold'],
'cold': ['deleted'],
}
def transition_story(story_id: str, new_status: str, current_status: str):
if new_status not in VALID_TRANSITIONS.get(current_status, []):
raise ValueError(
f"Invalid transition: {current_status} -> {new_status} for story {story_id}"
)
The state machine transition validation catches pipeline bugs early - if archival workers try to skip the sweep_pending state, the transition is rejected, preventing silent data loss from a misrouted Kafka message.
Index Cleanup After Expiry
After a story expires, its entry in user_stories must be marked or removed so it stops appearing in the story ring. Rather than deleting rows (Cassandra deletes create tombstones that slow reads), we set expired = true and filter at the application layer:
# Read path - active stories for user (filters expired)
def get_active_stories(user_id: int, limit: int = 50) -> list:
rows = session.execute(
"SELECT story_id, created_at, expired FROM user_stories "
"WHERE user_id = %s AND created_at > %s ORDER BY created_at DESC LIMIT %s",
[user_id, time.time() - 86400 * 2, limit] # last 48h window
)
return [r for r in rows if not r.expired]
The 48h window bounds the tombstone accumulation - Cassandra only sees rows from the last 2 days, limiting the scan range.
Scaling and Performance
Given:
- 500M stories/day = 5,787 stories/second average
- Peak: 3x average = ~17,000 stories/second
- Average story size: 2MB (image) to 20MB (15s video)
- Average media keys per story: 3 (image + 2 video segments)
- Expiry Index Redis entry: 36 bytes per story (UUID + score)
Expiry Index storage:
- Active stories (48h): 1 billion stories * 36 bytes = 36 GB Redis total
- Per shard (16 shards): ~2.25 GB each - fits in a single Redis instance
Archival pipeline throughput:
- 500M stories/day = 5,787/second to archive
- Each archival involves 3 S3 COPY + 3 S3 DELETE = 6 API calls
- 5,787 stories/sec * 6 calls = ~35,000 S3 API calls/sec
- S3 request limit: 3,500 PUT/COPY/DELETE per prefix per second
- Solution: use 20 prefix shards (user_id % 20) = 175,000 req/sec capacity
Media storage migration:
- 500M stories/day * 10MB average = 5 PB/day transferred to cold
- S3 Glacier IR data transfer: included in PUT pricing
- Monthly cold storage: 500M * 10MB * 30 days = 150 PB (with lifecycle expiry)
Sweep worker compute:
- 347,000 story IDs per minute-bucket average
- 16 shards * 1 Redis pipeline = ~22,000 IDs per shard-bucket
- Cassandra batch write: 500 rows/batch = 44 batches per shard
- At 10ms per batch: 440ms per shard = comfortably within 60s sweep window
The dominant bottleneck at scale is the S3 API call rate during archival. The 20 prefix shard strategy distributes S3 API load across prefixes, each of which has its own rate limit. Story media keys are already structured as {user_id % 20}/{story_id}/{filename}, making this transparent to the application layer.
Dropbox’s camera upload and sync pipeline uses the same prefix-sharding approach for S3 to stay within per-prefix rate limits. The key is that the shard prefix is embedded in the object key at upload time, making it permanent and consistent across all operations on that object.
Failure Modes and Recovery
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| Sweep worker crash mid-bucket | Redis lease TTL expires (5 min), monitoring on sweep lag | Stories in that bucket expire up to 5 min late | New worker acquires lease, re-drains bucket; Cassandra LWT prevents double-processing |
| Redis shard failure | Redis cluster failover alarm | Expiry Index unavailable for affected shard; new stories can’t be indexed | Failover to replica in <30s; stories created during outage indexed on recovery via WAL replay |
| Kafka consumer lag (archival) | Lag > 2h threshold alert | Stories remain in hot storage longer | Scale archival workers; hot storage is not deleted until cold copy verified, so no data loss |
| S3 COPY failure to cold bucket | ETag mismatch or exception | Media stays in hot storage, archive_status stays ‘archiving’ | Dead-letter queue retries; archival worker retries on next Kafka redelivery |
| Cassandra node failure | Nodetool nodesync + health checks | Partial expiry marking failures | Cassandra RF=3 ensures writes succeed at quorum; failed stories re-swept on next worker run |
| Clock skew on story creation | NTP monitoring, p99 latency alerts | Story expires slightly early/late | Expiry precision SLA is 60s, not 1s; NTP drift is typically <100ms, well within tolerance |
The most operationally dangerous mistake is deleting the hot S3 object before confirming the Kafka archival message was processed. If the Kafka topic is misconfigured with a short retention (e.g., 1 hour) and the archival consumer falls behind, messages are lost and media is never archived - it simply disappears.
Comparison of Approaches
| Approach | Expiry Latency | Database Load | Complexity | Best Fit |
|---|---|---|---|---|
| Full table scan (cron) | ~1 min (if fast) | Extreme - full scan per day | Low | <10M stories/day |
| Cassandra built-in TTL | Eventual (GC-based) | Low | Very low | When no archival needed |
| Time-bucketed expiry index (our approach) | <60 seconds guaranteed | Targeted writes only | Medium | 100M+ stories/day with archival |
| Event-driven (expiry timer per story) | Exact to the second | Low on average, spiky | High | When exact expiry SLA required |
| Database expiry field + index scan | Minutes to hours | Moderate (index scan only) | Low | 1M-50M stories/day |
The time-bucketed expiry index wins at our scale because it converts sweep cost from O(total_active_stories) to O(stories_expiring_in_this_bucket). The event-driven approach (one timer per story) would require 500M active timers - manageable in a purpose-built system like AWS EventBridge Scheduler, but operationally complex to debug and replay. The bucket approach is simpler, more observable, and straightforward to replay: re-drain any bucket from Redis or a backup.
Key Takeaways
- Pre-index at creation: Write expiry bucket entries at story creation time to avoid full table scans at expiry time.
- TTL-based expiry tracking: Use time-bucketed queues instead of querying
WHERE created_at < threshold- this separates expiry tracking from data storage. - Sweep-based deletion: Process one time-bucket per minute across a worker fleet to keep per-worker load bounded and predictable.
- Tiered storage: Move expired media to Glacier Instant Retrieval immediately after expiry - the 40% cost reduction compounds at petabyte scale.
- Archival pipeline: Run the archival pipeline asynchronously via Kafka, hours behind the expiry sweep, to decouple expiry precision from archival throughput.
- Lazy hydration: Cold storage retrieval via presigned URLs requires no media migration back to hot storage for the author access pattern.
- Bloom filter serving gate: A Bloom filter provides O(1) expiry checks on the read hot path, reserving Cassandra reads for the rare false-positive case.
- Index cleanup denormalization: Mark
expired = truein user_stories rather than deleting rows, to avoid Cassandra tombstone accumulation on the hot read path.
The counter-intuitive lesson from this system: the hardest part of expiry at scale is not deleting data - it’s ensuring the deletion doesn’t hurt the read path. The Bloom filter serving gate is the key: it completely decouples the sweep pipeline’s write throughput from the story serving read latency, letting both scale independently.
Frequently Asked Questions
Q: Why not just use Cassandra’s built-in TTL for expiry? A: Cassandra TTL marks rows as tombstones and removes them during compaction, but it doesn’t emit application events, trigger S3 archival, update secondary indexes, or integrate with the Serving Gate. It’s a storage mechanism, not an application lifecycle hook. We use Cassandra TTL as a 7-day safety net, but the sweep pipeline is the primary expiry mechanism.
Q: Why not use database-level expiry (e.g., Redis TTL on story objects)? A: Story metadata is in Cassandra (not Redis) and the media is in S3. Redis TTL would only handle a caching layer, not the authoritative data. More importantly, we need to archive before deleting - a simple TTL delete loses the data entirely, which violates the “authors can retrieve archived stories” requirement.
Q: What happens if the sweep worker falls more than 1 hour behind? A: The sweep worker processes one minute-bucket at a time. If it falls behind, expired stories remain visible past their 24-hour mark. We alert on sweep lag > 5 minutes and scale workers horizontally. Since each shard is independent, adding workers increases throughput linearly. In practice, the worker fleet is sized for 3x peak load to handle traffic spikes.
Q: How do we handle the edge case where a user’s Story is in the archival pipeline when they delete their account? A: Account deletion triggers a separate tombstone record that the archival pipeline checks before completing an archive. If the account is deleted, the archival worker skips the cold copy and issues a direct S3 delete on the hot object instead. The tombstone check is a single Cassandra read per story, adding ~5ms latency to the archival path.
Q: Why Glacier Instant Retrieval instead of standard S3 for archives? A: Standard S3 costs ~$0.023/GB/month; Glacier IR costs ~$0.004/GB/month - an 83% reduction. Glacier IR provides millisecond retrieval latency, which is adequate for the archive browsing use case (3 second SLA). Standard Glacier (with 3-5 hour retrieval) would be even cheaper but would violate the SLA when authors browse their archives.
Q: How does the system handle Stories that expire while being actively viewed? A: The Serving Gate check happens at Story load time, not during streaming. A viewer who loads a Story 1 second before expiry can watch it to completion - we don’t terminate mid-stream. The gate only blocks new load requests after expiry. This is intentional: mid-stream termination creates a worse user experience than the 60-second post-expiry grace window already built into the SLA.
Interview Questions
Q: How would you design the expiry index to minimize Redis memory usage while maintaining sub-60-second expiry latency?
Expected depth: Discuss time-bucket granularity tradeoffs (1-minute vs 1-second buckets), sorted set vs plain set for bucket members, UUID compression (binary vs string), shard count vs key fan-out cost, and Redis cluster vs standalone for the 36GB total index size estimate.
Q: The archival pipeline needs to move 5 PB/day to cold storage. How do you handle S3 rate limiting?
Expected depth: Cover S3 per-prefix rate limits (3,500 PUT/DELETE per second per prefix), prefix sharding via user_id % N embedded in object keys, parallel consumer workers per shard, and the tradeoff between shard count and operational complexity. Mention S3 Transfer Acceleration for cross-region archival.
Q: How would you implement a “recently expired” recovery window where support can restore a story that was mistakenly reported as expired?
Expected depth: Discuss the 2-hour gap between expiry sweep and archival pipeline, maintaining a hot storage grace window, the admin API to transition archive_status back to active, Bloom filter invalidation challenges (you can’t remove from a Bloom filter - discuss filter rotation), and the audit log for expiry recoveries.
Q: A bug causes 50 million stories to be double-indexed in the expiry index (indexed twice). How does the system handle this?
Expected depth: Cassandra LWT on the expired flag update prevents double-processing, Kafka producer deduplication via idempotent producer config, the archival worker’s idempotency via ETag comparison before DELETE, and monitoring signals that would detect the anomaly (sweep worker throughput 2x normal, Kafka topic lag patterns).
Q: How would you add support for “Close Friends” stories where expiry behavior is identical but visibility is restricted to a subset of followers?
Expected depth: The expiry pipeline is visibility-agnostic - expiry events and archival work the same way. Close Friends restriction lives in the Serving Gate (authorization check before the Bloom filter check). Discuss storing the close-friends list as a denormalized set in the story metadata vs. a separate access control table, and the fan-out implications for the archival notification step.
Want to see how these patterns hold up when traffic spikes 50x at 3 AM? That's exactly what this Premium deep-dive covers.