Schema Migrations: Changing Your Database Without Breaking Production

You need to add a phone_number column to the users table. You write the migration: ALTER TABLE users ADD COLUMN phone_number VARCHAR(20). You run it in production. PostgreSQL locks the table. For 45 minutes. Every request that touches the users table times out. Your site is effectively down.

You just learned why schema migrations on large tables are one of the most dangerous operations in production engineering.

Why schema migrations are hard

In a small database, ALTER TABLE is instant. In a large database, it is not. The database must:

Acquire a lock on the table (blocking all reads and writes)
Rewrite every row to include the new column
Update all indexes
Release the lock

For a 400 million row table, step 2 takes minutes to hours. During that time, your application cannot read or write to the table.

This is the fundamental tension: your application needs the database to be available, but schema changes require exclusive access.

Online schema migration approaches

Approach 1: Expand-contract (backward-compatible migrations)

The safest approach for zero-downtime migrations. Split every schema change into multiple steps:

Expand phase: Add the new schema element in a backward-compatible way. The old code still works.

Add a nullable column (old code ignores it, new code writes to it)
Add a new table (old code does not use it)
Add a new index (does not affect reads or writes)

Migrate phase: Backfill existing data to populate the new schema element.

Contract phase: Remove the old schema element once all code uses the new one.

Drop the old column
Drop the old table
Drop the old index

This approach requires multiple deployments but each step is safe.

graph LR
subgraph expand["1. Expand - Add nullable column"]
  E1["ALTER TABLE users
ADD COLUMN phone VARCHAR(20)"]
  E2["Deploy new code
that writes phone on new users"]
end

subgraph migrate["2. Migrate - Backfill existing rows"]
  M1["UPDATE users SET phone = ''
WHERE phone IS NULL
LIMIT 1000 (batched)"]
end

subgraph contract["3. Contract - Make NOT NULL, drop old"]
  C1["ALTER TABLE users
ALTER COLUMN phone SET NOT NULL"]
  C2["Deploy code that
requires phone"]
end

expand --> migrate --> contract

style E1 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style M1 fill:#FAEEDA,stroke:#854F0B,color:#633806
style C1 fill:#EEEDFE,stroke:#534AB7,color:#3C3489

Approach 2: Online DDL tools

Tools that perform schema changes without locking the table:

pt-online-schema-change (pt-osc) - Creates a new table with the new schema, copies data in batches, uses triggers to keep the new table in sync with writes to the old table, then atomically swaps the tables. Works with MySQL.

gh-ost (GitHub’s Online Schema Transmogrifier) - Similar to pt-osc but uses binary log replication instead of triggers. More reliable under high write load. Used by GitHub for all MySQL schema changes.

pglogical / pg_repack - PostgreSQL tools for online table rewrites. pg_repack rebuilds a table without holding a long lock.

Liquibase / Flyway - Migration management tools that track which migrations have been applied. They do not make migrations online by themselves, but they manage the migration lifecycle.

Approach 3: Database-native online DDL

Modern databases have improved their online DDL capabilities:

PostgreSQL - CREATE INDEX CONCURRENTLY builds an index without locking the table. ALTER TABLE ... ADD COLUMN with a default value is instant in PostgreSQL 11+ (it stores the default in the catalog, not in each row). ALTER TABLE ... ADD COLUMN NOT NULL DEFAULT still requires a table rewrite in older versions.

MySQL 8.0 - Many ALTER TABLE operations are online (ALGORITHM=INPLACE, LOCK=NONE). Adding a column, adding an index, changing column defaults - all online. Changing column type or adding a NOT NULL constraint still requires a table copy.

CockroachDB - Schema changes are always online. The database handles the migration internally using a multi-version schema approach.

The backfill problem

Adding a column is the easy part. Populating it for existing rows is the hard part.

A naive UPDATE users SET phone = '' WHERE phone IS NULL on 400 million rows:

Locks rows as it updates them
Generates enormous write-ahead log
Runs for hours
Blocks other queries

The right approach: batched backfill.

-- Run this in a loop until no rows remain
UPDATE users
SET phone = ''
WHERE id IN (
  SELECT id FROM users
  WHERE phone IS NULL
  LIMIT 1000
);

Process 1,000 rows at a time. Sleep 100ms between batches to give the database breathing room. The full migration takes longer but does not impact production.

Tools like strong_migrations (Rails) and django-zero-downtime-migrations enforce safe migration patterns automatically.

Where it breaks or gets interesting

The NOT NULL constraint trap

Adding a NOT NULL column without a default requires every existing row to have a value. The database must rewrite every row. On a large table, this is a long-running lock.

Safe approach: add the column as nullable first, backfill all rows, then add the NOT NULL constraint. In PostgreSQL 12+, adding a NOT NULL constraint with a CHECK constraint is instant if the check is already validated.

Renaming a column

You cannot rename a column without downtime if you do it in one step. The old code uses the old name; the new code uses the new name. During the deployment window, some servers run old code and some run new code.

Safe approach:

Add a new column with the new name
Write to both columns in the application
Backfill the new column from the old column
Switch reads to the new column
Stop writing to the old column
Drop the old column

This takes multiple deployments but is zero-downtime.

Index creation on large tables

CREATE INDEX on a large table takes a long time and (without CONCURRENTLY) locks the table. Always use CREATE INDEX CONCURRENTLY in PostgreSQL. The concurrent build takes longer but does not block reads or writes.

Caveat: CREATE INDEX CONCURRENTLY can fail if there are constraint violations. If it fails, it leaves an invalid index that must be dropped and recreated.

Foreign key constraints

Adding a foreign key constraint requires scanning the entire table to verify all existing rows satisfy the constraint. This is a long-running operation.

PostgreSQL approach: add the constraint as NOT VALID first (skips the scan), then validate it in a separate step with VALIDATE CONSTRAINT (which takes a weaker lock).

graph TB
subgraph dangerous["Dangerous Migrations"]
  D1["ADD COLUMN NOT NULL
without default"]
  D2["CREATE INDEX
without CONCURRENTLY"]
  D3["ADD FOREIGN KEY
without NOT VALID"]
  D4["RENAME COLUMN
in one step"]
  D5["DROP COLUMN
while old code uses it"]
end

subgraph safe["Safe Alternatives"]
  S1["ADD COLUMN nullable
then backfill
then NOT NULL"]
  S2["CREATE INDEX CONCURRENTLY"]
  S3["ADD FOREIGN KEY NOT VALID
then VALIDATE CONSTRAINT"]
  S4["Add new column
dual-write
switch reads
drop old"]
  S5["Deploy code without column
then DROP COLUMN"]
end

D1 --- S1
D2 --- S2
D3 --- S3
D4 --- S4
D5 --- S5

style D1 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style D2 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style D3 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style D4 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style D5 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style S1 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style S2 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style S3 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style S4 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style S5 fill:#E1F5EE,stroke:#0F6E56,color:#085041

Real-world systems

GitHub - Uses gh-ost for all MySQL schema changes. Runs migrations during low-traffic periods. Has a migration review process where all schema changes are reviewed before running.

Shopify - Uses strong_migrations gem to enforce safe migration patterns in Rails. Automatically rejects dangerous migrations and suggests safe alternatives.

Stripe - Documented their approach to zero-downtime migrations: expand-contract pattern, batched backfills, and careful ordering of deployment steps.

Facebook - Built OnlineSchemaChange (OSC), the predecessor to pt-osc, for MySQL schema changes at scale.

Airbnb - Uses a combination of expand-contract and gh-ost. Has a dedicated database reliability team that reviews all schema changes.

How to apply it in practice

The migration checklist

Before running any migration in production:

Test on a copy of production data - Run the migration on a database with the same size and data distribution as production. Measure how long it takes.
Check for table locks - Will this migration acquire a long-running lock? Use EXPLAIN or the database’s DDL documentation.
Plan the backfill - If you are adding a column, how will you populate existing rows? Batched update? Background job?
Check application compatibility - Can the old code run against the new schema? Can the new code run against the old schema? (Important for rolling deployments.)
Have a rollback plan - Can you revert the migration if something goes wrong? Some migrations (adding a column) are easy to roll back. Others (dropping a column) are not.
Schedule during low traffic - Run migrations during off-peak hours when possible.

Migration tooling

Flyway - Java-based, supports SQL and Java migrations, tracks migration history in a table
Liquibase - XML/YAML/JSON/SQL migrations, supports rollbacks, database-agnostic
Alembic - Python/SQLAlchemy migrations
Rails Active Record Migrations - Ruby on Rails built-in migration system
golang-migrate - Go migration tool supporting multiple databases

FAQ

Q: Should you run migrations before or after deploying new code?

It depends on the migration. For additive migrations (adding a column, adding a table): run the migration first, then deploy the code. The old code ignores the new column; the new code uses it. For destructive migrations (dropping a column, dropping a table): deploy the code first (which stops using the old column), then run the migration. This ensures no running code depends on what you are about to remove. This is the expand-contract pattern applied to deployment ordering.

Q: How do you handle migrations in a microservices architecture?

Each service owns its own database and manages its own migrations. Migrations run as part of the service’s deployment pipeline. The challenge: if service A and service B share a database (which they should not, but sometimes do), coordinating migrations is complex. The solution is to ensure each service has its own schema or database. Cross-service data access goes through APIs, not direct database access.

Q: What do you do if a migration fails halfway through?

Most migration tools wrap migrations in transactions. If the migration fails, the transaction rolls back and the database is in its original state. But some operations cannot be rolled back (DDL in MySQL is auto-committed, for example). For these, you need a manual rollback plan: a reverse migration that undoes the change. Always write the rollback migration before running the forward migration. Test the rollback in staging.

Interview questions

Q1: You need to add a NOT NULL column with no default to a 500 million row table in PostgreSQL. Walk through how you do this without downtime.

Strong answer: Use the expand-contract pattern. Step 1: add the column as nullable with no constraints. This is instant in PostgreSQL. Step 2: deploy new application code that writes a value to the new column for all new rows. Step 3: backfill existing rows in batches of 10,000 rows with a sleep between batches to avoid overwhelming the database. This runs as a background job over hours or days. Step 4: once all rows have a value, add a CHECK constraint with NOT VALID to avoid scanning existing rows. Step 5: validate the constraint with VALIDATE CONSTRAINT which takes a ShareUpdateExclusiveLock (does not block reads or writes). Step 6: add the NOT NULL constraint. In PostgreSQL 12+, if the CHECK constraint is already validated, adding NOT NULL is instant. Total downtime: zero.

Q2: Your team wants to rename a column from user_name to username in a high-traffic table. How do you do this safely?

Strong answer: You cannot rename in one step because old and new code will run simultaneously during the deployment. The safe approach: add a new column username, deploy code that writes to both user_name and username on every write, backfill username from user_name for existing rows, deploy code that reads from username (still writes to both), verify no reads use user_name, deploy code that only writes to username, drop user_name. This takes 4-5 deployments over days or weeks but is completely zero-downtime. The dual-write phase ensures both columns are always in sync during the transition.

Q3: How does gh-ost work and why is it safer than pt-online-schema-change for high-write tables?

Strong answer: Both tools create a shadow table with the new schema and copy data from the original table. The difference is how they keep the shadow table in sync with ongoing writes. pt-osc uses triggers: it adds INSERT, UPDATE, DELETE triggers to the original table that mirror changes to the shadow table. Under high write load, triggers add overhead to every write and can cause replication lag. gh-ost uses binary log replication: it connects to MySQL as a replica and reads the binlog to apply changes to the shadow table. This has much lower overhead on the primary because there are no triggers. It also gives gh-ost more control: it can pause the migration if replication lag increases, throttle based on load, and resume without losing progress. The final cutover is a brief lock (typically under 1 second) to swap the tables atomically.