The Migration That Broke Production

schema-migration database zero-downtime postgresql mysql

The Migration That Broke Production

ALTER TABLE on 400M rows. It locks. Everything downstream times out. You learn about online schema migration the hard way.

⏱ 12 min read📐 Intermediate🔒 Schema Migration

It’s 3:17 AM on a Wednesday. Jana, a senior backend engineer, has been asleep for exactly 47 minutes when her phone goes off. Every API endpoint on the dashboard is red - users.get, orders.create, auth.verify. All 504. The error rate has been climbing for 41 minutes. She drills into the slow query log and finds it immediately: a single ALTER TABLE accounts ADD COLUMN last_login_at TIMESTAMP NULL that has been running since 10:36 PM. It isn’t slow. It’s holding an exclusive metadata lock on the accounts table, and every query that touches that table has been queueing behind it ever since.

The migration was kicked off by a developer who added the column, pushed the deploy script, watched the first progress line, and went to bed. Routine operation: add a nullable column, no default. On a table with 10,000 rows this finishes in under a second. On a table with 400 million rows, MySQL is doing a full in-place table rebuild - copying every row into a new structure before swapping. By the time Jana is paged, the copy is 73% done. Killing it means starting over. Not killing it means 13 more minutes of total outage.

She kills it. The connection queue drains. The APIs start responding. She writes “DO NOT RUN THIS MIGRATION” in Slack and spends the next hour learning what she should have done instead.

Database schemas change constantly. New columns, dropped columns, renamed indexes, type promotions - no production system stays frozen. The difference between a migration that runs silently in the background and one that takes down the site for an hour is a matter of technique, not willpower. This is the online schema migration problem: how do you restructure a live table without stopping the world?

Why This Happens

Most databases implement DDL operations - CREATE TABLE, ALTER TABLE, DROP INDEX - as metadata-level changes that require exclusive ownership of the affected object. The specific behavior differs between MySQL and PostgreSQL, but the failure mode is identical at scale.

In MySQL’s InnoDB, many ALTER TABLE variants trigger what the engine calls an in-place rebuild: the original table is copied row by row into a new internal structure, then the old table is discarded. During this copy, the engine holds a metadata lock that blocks conflicting DML - inserts, updates, and deletes - from committing. It isn’t that your queries are being refused. They’re being queued. And those queued queries hold database connections. And the connection pool has a ceiling. When that ceiling is hit, new requests aren’t even queued - they fail immediately.

The failure chain looks like this:

ALTER TABLE accounts (400M rows)
  → exclusive metadata lock acquired by InnoDB
    → all DML against accounts tables into a wait queue
      → waiting queries each hold a connection slot
        → connection pool ceiling reached (100 slots)
          → new incoming requests cannot connect
            → APIs return 504 / connection timeout
              → total service outage

PostgreSQL handles DDL more gracefully in most cases - ALTER TABLE is transactional and short-lived for many operations. But operations that require a table rewrite (adding a column with a non-null default pre-PG11, changing a column’s type, rebuilding an index) still take an AccessExclusiveLock that blocks everything. The same failure chain applies.

The root cause isn’t a bug. Correctness guarantees require exclusivity. Exclusivity at the table level does not scale to hundreds of millions of rows.

The Naive Solution (and Where It Breaks)

The most common first response is the maintenance window: schedule the migration for 2 AM Sunday, put up an “under maintenance” page, run ALTER TABLE, wait. On a table under a million rows this is completely reasonable. It finishes in seconds and nobody notices.

At 400 million rows, the math breaks:

Small scale - 100K rows:
  ALTER TABLE → completes in ~0.3s → zero observable impact

Large scale - 400M rows:
  ALTER TABLE → runs 45-90 minutes depending on row width
  Every read and write to the table is blocked for the duration
  Maintenance window = 90 minutes of hard downtime
  Miss the window or miscalculate time? Back to the 3 AM page.

The second naive solution is to check the MySQL docs for operations that qualify as “instant” or “online.” MySQL 8.0 introduced a genuine ALGORITHM=INSTANT for some operations - adding a nullable column with no default being the most common. But this is engine-version-specific. A developer who assumes instant behavior on MySQL 5.7 will trigger a full rebuild anyway, silently, with no warning.

Naive ALTER TABLE failure cascade: metadata lock, queued connections, 504 outage

Neither approach works for a service that cannot accept 90 minutes of hard downtime. You need migrations that run invisibly in the background while traffic flows normally. Three tools and one pattern make that possible.

Online DDL - MySQL’s Built-In Attempt

MySQL 5.6 introduced online DDL: a set of ALTER TABLE algorithms that attempt schema changes without a full table lock. The syntax is explicit, which is its best feature:

ALTER TABLE accounts
  ADD COLUMN last_login_at TIMESTAMP NULL,
  ALGORITHM=INPLACE, LOCK=NONE;

ALGORITHM=INPLACE tells MySQL to attempt the change in-place. LOCK=NONE tells it to fail immediately if the operation requires a table lock - rather than silently taking one and blocking for an hour. If you don’t specify LOCK=NONE, MySQL will take whatever lock it needs without complaining.

When it works, online DDL is elegant. MySQL buffers concurrent DML in a row log during the rebuild, then applies the buffered changes when the rebuild finishes. Applications see no interruption. When it fails - because the operation requires a lock - you get a clear error, not a production incident.

The problem is coverage. Not all operations support ALGORITHM=INPLACE, LOCK=NONE. Adding a column with a non-null default, changing a column’s data type, and rebuilding certain index types still require a table lock. Online DDL covers the easy cases and gives engineers false confidence on the hard ones.

PostgreSQL 11 added a similar win: adding a nullable column with no default is now an instant catalog update - no row rewrite, no lock. But any column that needs to backfill data (ADD COLUMN created_at TIMESTAMP DEFAULT now() NOT NULL) still rewrites the table.

pt-online-schema-change - The Shadow Table Approach

pt-online-schema-change (pt-osc), from Percona’s toolkit, implements a shadow table pattern that sidesteps the lock entirely. Instead of modifying the table in place, it creates a copy with the new schema, syncs data between them incrementally, then atomically swaps the tables.

The process has four stages:

-- Stage 1: create shadow table with new schema
CREATE TABLE _accounts_new LIKE accounts;
ALTER TABLE _accounts_new ADD COLUMN last_login_at TIMESTAMP NULL;

-- Stage 2: install triggers to capture concurrent writes
CREATE TRIGGER pt_osc_ins AFTER INSERT ON accounts
  FOR EACH ROW REPLACE INTO _accounts_new VALUES (NEW.*);

CREATE TRIGGER pt_osc_upd AFTER UPDATE ON accounts
  FOR EACH ROW REPLACE INTO _accounts_new VALUES (NEW.*);

CREATE TRIGGER pt_osc_del AFTER DELETE ON accounts
  FOR EACH ROW DELETE FROM _accounts_new WHERE id = OLD.id;

-- Stage 3: copy rows in chunks (low priority, throttled)
INSERT LOW_PRIORITY IGNORE INTO _accounts_new
  SELECT * FROM accounts
  WHERE id BETWEEN :chunk_start AND :chunk_end;

-- Stage 4: validate row counts, then atomic swap
RENAME TABLE accounts TO _accounts_old,
             _accounts_new TO accounts;

pt-online-schema-change: shadow table creation, triggers, chunk copy, atomic RENAME

The genius is in the timing. The chunk copy runs as low-priority INSERT...SELECT statements that yield to normal traffic. The triggers ensure that any write arriving during the copy lands in both the original and shadow tables simultaneously. By the time the copy completes, the shadow table is fully current. The final RENAME takes a brief lock - measured in milliseconds - swaps the names, and releases.

The weakness is the triggers themselves. On a table with high write throughput, three triggers on every write adds overhead - typically 5-15% depending on row width and write volume. More critically, MySQL prior to 8.0 allows only one trigger per event per table, so if your table already has triggers, pt-osc is blocked entirely.

gh-ost - The Triggerless Migration

GitHub’s gh-ost (GitHub Online Schema Transmogrifier) was built specifically to solve the trigger problem, at the scale of GitHub’s largest tables. Instead of installing triggers on the original table, gh-ost reads from MySQL’s binary log - the same replication stream that read replicas consume.

# gh-ost connects as a pseudo-replica:
# - subscribes to binlog events from primary (or a replica)
# - creates _accounts_gho with new schema
# - applies each binlog row event to _accounts_gho in real time
# - simultaneously runs a background row copy

# Initiation (simplified):
gh-ost \
  --host=primary.db.internal \
  --table=accounts \
  --alter="ADD COLUMN last_login_at TIMESTAMP NULL" \
  --execute \
  --chunk-size=1000 \
  --max-lag-millis=1500 \
  --throttle-control-replicas=replica.db.internal

gh-ost: binlog-based migration, no triggers, parallel row copy and changelog apply

Reading from the binlog instead of triggers means gh-ost imposes zero per-row overhead on the primary, works on tables that already have triggers, and supports features that pt-osc cannot: real-time throttling via a Unix socket, pause and resume without losing progress, and a cut-over mechanism designed to avoid lock contention races.

The cut-over step uses a two-connection lock dance. One connection holds a LOCK TABLES on the original table. A second connection executes the RENAME. The first releases immediately after the rename completes. The window where both the lock and the rename are active - the window during which application writes are blocked - is typically 100-300 milliseconds. GitHub runs gh-ost on tables exceeding a billion rows, during business hours, on their primary database.

The trade-off is operational complexity. gh-ost requires binary logging to be enabled (binlog_format=ROW), which most production MySQL instances already have. It requires a replica for the throttle check. And the cut-over can be postponed indefinitely via the socket interface, which is useful for coordinating with deploys but requires a human to trigger the final swap.

The Expand-Contract Pattern - No Tools Required

The expand-contract pattern is an application-level approach that avoids external tooling entirely. It works by splitting one schema change into three sequential deploys:

Phase 1 - Expand:
  ALTER TABLE accounts ADD COLUMN login_ts TIMESTAMP NULL;
  # On MySQL 8+ / PG11+: this is instant - catalog change only
  # Application: reads old_col, writes old_col only

Phase 2 - Migrate (separate deploy):
  # Application: writes to BOTH old_col AND login_ts
  # Background job: backfills login_ts for existing rows
  # Reads: still from old_col - safe to roll back any time

Phase 3 - Contract (separate deploy):
  # Application: reads from login_ts, writes to login_ts only
  # Old column: no longer written - safe to drop
  # Rollback: switch reads back to old_col (no schema change needed)

Phase 4 - Cleanup (future deploy):
  ALTER TABLE accounts DROP COLUMN old_col;
  # Safe now - no code touches it

The key property is rollback safety at every phase. If Phase 2’s backfill job produces wrong data, you roll back the application code - the schema has only been expanded, never in a breaking state. This makes expand-contract the right choice for operations that neither pt-osc nor gh-ost can execute atomically: renaming a column, splitting one column into two, or changing a column’s semantic meaning alongside its type.

The cost is coordination overhead. Three deploys instead of one, a background job to monitor, and a period where application code must dual-write to both columns. For teams running gh-ost in production, expand-contract is a complement for the hard cases. For teams without migration tooling, it’s a safe default for everything.

The Full Architecture

Full online schema migration architecture: primary DB, replica, gh-ost, shadow table, atomic cut-over

The full picture during an online migration: application traffic flows to the primary database without interruption. gh-ost connects to a replica to read the binlog stream, minimizing load on the primary. A shadow table grows in the background, catching both the initial row copy and every concurrent write that arrives during the migration. A drift monitor watches copy progress and binlog lag - if lag exceeds the threshold, gh-ost throttles itself automatically. When the shadow table has caught up, the atomic cut-over executes in a sub-second lock window. From the application’s perspective, the table was never unavailable.

Component Deep Dives

Chunked Row Copy

Both pt-osc and gh-ost copy rows in small batches rather than a single INSERT INTO ... SELECT * FROM. The reason is contention. A single INSERT...SELECT across 400 million rows holds a read lock for its entire duration and generates a massive binary log event that delays replication. Chunks of 1,000-10,000 rows each produce small, bounded transactions that interleave cleanly with normal traffic.

-- gh-ost chunk copy (simplified):
INSERT IGNORE INTO _accounts_gho
  (id, email, created_at, last_login_at)
SELECT
  id, email, created_at, NULL
FROM accounts
WHERE id > :last_copied_id
  AND id <= :chunk_end
LOCK IN SHARE MODE;

-- Between each chunk, gh-ost checks replication lag:
-- if replica lag > max_lag_millis, sleep and retry

Throttle and Lag Control

gh-ost’s most operationally useful feature is its runtime control socket. While a migration is running:

# Check migration status:
echo "status" | nc -U /tmp/gh-ost.sock

# Throttle immediately (e.g., before a deploy):
echo "throttle" | nc -U /tmp/gh-ost.sock

# Resume:
echo "no-throttle" | nc -U /tmp/gh-ost.sock

# Set a new chunk size at runtime:
echo "chunk-size=500" | nc -U /tmp/gh-ost.sock

This turns a migration into a controllable background process rather than a fire-and-forget script.

Safe Backfill Jobs for Expand-Contract

When running a backfill in Phase 2 of expand-contract, the job must be idempotent and resumable:

def backfill_login_ts(batch_size=1000):
    last_id = get_checkpoint()  # stored in a separate control table
    while True:
        rows = db.query("""
            SELECT id, old_login_col
            FROM accounts
            WHERE id > %s
              AND login_ts IS NULL
            ORDER BY id
            LIMIT %s
        """, last_id, batch_size)

        if not rows:
            break

        db.execute("""
            UPDATE accounts
            SET login_ts = old_login_col
            WHERE id IN %s
              AND login_ts IS NULL
        """, [r.id for r in rows])

        last_id = rows[-1].id
        set_checkpoint(last_id)
        time.sleep(0.05)  # yield to production traffic

The AND login_ts IS NULL guard makes the update idempotent - rerunning the job after a failure won’t overwrite values that were already migrated.

Approach Comparison

Approach	Downtime	Write Overhead	Requires Tool	Handles Existing Triggers	Best Use Case
Naive `ALTER TABLE`	45-90 min	None	No	Yes	Dev/staging only - never production at scale
MySQL Online DDL	Near-zero (limited ops)	Low	No	Yes	Simple adds on MySQL 8+ / PG 11+ nullable columns
pt-online-schema-change	None (ms cut-over)	5-15% (triggers)	Yes (Percona)	No - blocked	High-write tables without pre-existing triggers
gh-ost	None (ms cut-over)	~0% on primary	Yes (GitHub)	Yes	Any table; preferred for 100M+ rows
Expand-Contract	None (3 deploys)	Dual-write during Phase 2	No	Yes	Column renames, type changes, no tooling available

Key Takeaways

Online DDL is the right first check on MySQL 8+ and PostgreSQL 11+ - a nullable column add is instant; anything that rewrites rows is not.
pt-online-schema-change works reliably for most cases but is blocked by pre-existing triggers; its chunk copy is gentler than a full rebuild.
gh-ost is the industry default for large-scale migrations - it reads the binlog instead of installing triggers, imposes near-zero primary overhead, and lets you pause mid-migration.
Shadow tables are the mechanism behind both pt-osc and gh-ost: create a copy with the new schema, sync incrementally, swap atomically.
The expand-contract pattern requires no tooling and is rollback-safe at every phase - it is the only approach that can rename a column without a cut-over window.
Chunk size and lag thresholds matter as much as tool choice - a migration that ignores replication lag will degrade your replica and delay backups.
The cut-over window - the brief lock taken at the very end - is typically under 300ms for gh-ost and under 10ms for pt-osc on a quiet table; this is the only true disruption.
Backfill jobs must be idempotent and resumable; a non-idempotent backfill that crashes at row 380M and restarts corrupts data already migrated.

Schemas are not immutable. The engineers who operate large databases long enough develop a healthy paranoia about DDL - not because the tools are bad, but because a mistake at scale is measured in lost revenue and 3 AM pages. The discipline isn’t in running the right tool; it’s in rehearsing the migration on a production-sized snapshot, measuring the lag impact, and having a tested abort procedure before you touch the primary.

Frequently Asked Questions

Q: Can I just add an index without downtime? A: In MySQL, use CREATE INDEX ... ALGORITHM=INPLACE, LOCK=NONE or let gh-ost handle it via --alter "ADD INDEX ...". In PostgreSQL, use CREATE INDEX CONCURRENTLY - it builds the index without locking the table, though it is slower and can fail if a conflicting lock appears during the build.

Q: How long does an online migration actually take on 400M rows? A: Expect 2-6 hours depending on row width, chunk size, replication lag threshold, and write throughput. gh-ost with a chunk size of 1,000 and a 1,500ms lag threshold on a table with moderate writes will typically process 100-200M rows per hour. Monitor rows_copied and estimated_rows via the status socket.

Q: What happens if gh-ost crashes mid-migration? A: The shadow table (_accounts_gho) is left in place with however many rows were copied. You can drop it and restart, or resume gh-ost - it will detect the existing shadow table and continue from where the chunk copy left off. No data is corrupted; the original table is untouched.

Q: Do I need a replica to run gh-ost? A: Not strictly. gh-ost can connect directly to the primary for binlog reads, but this is not recommended for large migrations - the replica connection allows gh-ost to throttle based on replica lag rather than primary lag, which is a more accurate measure of replication health. In practice, running against the primary is only acceptable on a single-node test setup.

Q: Can pt-osc and gh-ost handle foreign key constraints? A: Both handle it with caveats. pt-osc can recreate foreign keys on the shadow table before the rename. gh-ost avoids foreign keys on the shadow table during migration and relies on the application to enforce referential integrity - then recreates them post-swap. On tables with heavy FK constraints, the expand-contract pattern is often safer.

Q: Is there an equivalent to gh-ost for PostgreSQL? A: Yes: pg_repack rebuilds a table or index online using a trigger-based approach similar to pt-osc. pgactive and Citus’s pg_squeeze offer variations. PostgreSQL’s logical replication can also power a custom shadow-table approach for teams willing to invest in the tooling.

Interview Questions

Q: You’re asked to add a NOT NULL column with a default value to a 500M row PostgreSQL table with no downtime. Walk through your approach.

Expected depth: Start with the PostgreSQL version - PG 11+ supports adding a nullable column instantly (catalog change only), but NOT NULL with a default still rewrites pre-PG 11. Describe the expand-contract approach for pre-PG 11: add nullable first, backfill with a batched background job, add NOT NULL constraint using ADD CONSTRAINT ... NOT VALID then VALIDATE CONSTRAINT (two separate steps - validate acquires only a ShareUpdateExclusiveLock). Name the specific lock types and explain why NOT VALID + VALIDATE is safer than a single SET NOT NULL.

Q: Explain the difference between pt-online-schema-change and gh-ost, and when you would choose one over the other.

Expected depth: Cover the mechanism difference - pt-osc uses triggers (DML overhead, blocked by existing triggers), gh-ost reads the binlog (no primary overhead, works with existing triggers). Mention the operational differences: gh-ost’s runtime control socket, pause/resume, and the two-phase lock-free cut-over. Name when pt-osc is still a reasonable choice: smaller tables, environments where binlog access is restricted, or simple column additions where the trigger overhead is acceptable. Touch on PostgreSQL equivalents.

Q: What is the expand-contract pattern and what problem does it solve that online schema tools cannot?

Expected depth: Describe the three phases with concrete code. Explain why it is needed for column renames - a rename is a destructive schema change that no shadow-table tool can execute atomically without breaking in-flight queries. Name dual-write as the key mechanism during Phase 2 and explain the rollback story at each phase. Mention that expand-contract works at the application layer and requires no special database permissions or tooling.

Q: A developer just ran ALTER TABLE in production on a 300M row table and it’s been running for 30 minutes. What do you do?

Expected depth: Identify that killing it restarts the rebuild from scratch but may be necessary if the outage is unacceptable. Explain the connection queue drain that happens post-kill. Discuss how to prevent the same incident: gh-ost or pt-osc as the standard migration path, mandatory dry-run on a production-sized replica before primary, and a CI check that flags DDL statements in migration files against a known safe list. Mention ALGORITHM=INSTANT checks for MySQL 8+ before falling back to a tool.

Q: How do you test that a schema migration won’t cause problems before running it in production?

Expected depth: Describe running the migration on a production-sized clone - not staging with 10K rows, but a recent snapshot of the production database. Mention measuring binlog lag during the test run to predict production impact. Discuss load testing the cut-over: simulate production write throughput during the shadow copy and measure the lock window. Name pt-table-checksum or gh-ost’s built-in checksum for verifying row consistency between original and shadow tables. Mention the rehearsed abort: what is the exact command to kill the migration cleanly if something goes wrong at 11 PM.

This article is premium

One-time payment · Lifetime access to all premium content

Get Premium Access

Already have access? Sign in