Cloud & Infrastructure · Database
Zero-Downtime Database Migrations: A Field Guide for Production Systems
ALTER TABLE locks your database. Your migration takes longer than expected. Users get errors. Here's how to handle schema changes that don't interrupt production traffic.
Anurag Verma
8 min read
Sponsored
Every team hits this at some point: you need to rename a column, add a NOT NULL constraint, or change a data type on a table that receives thousands of writes per minute. You run the migration. It acquires a lock. Traffic starts timing out. On-call wakes up. The migration is still running.
The pain is avoidable. Zero-downtime schema changes require more steps, but each individual step is fast and safe. The overall process takes longer than a blocking migration but doesn’t affect production availability.
Why ALTER TABLE Blocks
When PostgreSQL runs most ALTER TABLE operations, it acquires an AccessExclusiveLock on the table. This lock blocks reads and writes until the operation completes. For small tables, that’s milliseconds. For tables with millions of rows or heavy write traffic, it can be seconds or minutes — long enough to cascade into application timeouts and queued requests.
Not all operations are equally expensive:
- Fast (usually instant): adding a nullable column with no default, dropping a column, renaming a table
- Slow (full table rewrite): adding a NOT NULL column with a default, changing a column type, adding a generated column
- Index operations: standard
CREATE INDEXblocks writes;CREATE INDEX CONCURRENTLYdoes not
The goal with zero-downtime migrations is to never run a slow, blocking operation against a live table.
The Expand-Contract Pattern
Most zero-downtime schema changes follow a pattern called expand-contract. The idea is to spread the change across multiple deployments so that each individual step is non-blocking.
Phase 1: Expand
Add the new structure without removing the old one. The application continues to use the old structure; the new structure exists but is unused.
-- Adding a new column, nullable, no default
-- Fast in PostgreSQL — no table rewrite
ALTER TABLE orders ADD COLUMN customer_name TEXT;
If you need the column to have a default, set it separately:
-- Fast: adds the default definition only
ALTER TABLE orders ALTER COLUMN customer_name SET DEFAULT '';
-- Backfill existing rows in batches to avoid long-running transactions
UPDATE orders
SET customer_name = ''
WHERE customer_name IS NULL
AND id BETWEEN 1 AND 10000;
-- Repeat for each batch until complete
Phase 2: Migrate
Deploy application code that writes to both the old and new structure simultaneously. Also backfill historical data from before Phase 1.
def create_order(user_id: int, total: float, user: User):
order = Order(
user_id=user_id,
total=total,
legacy_customer_id=user.id, # old column — still written
customer_name=user.full_name, # new column — written from now on
)
db.session.add(order)
db.session.commit()
For reads during this phase, prefer the new column but fall back to the old:
def get_customer_display_name(order: Order) -> str:
return order.customer_name or get_legacy_name(order.legacy_customer_id)
Phase 3: Verify
Before contracting (removing the old structure), confirm the migration is complete:
-- No nulls should remain after backfill
SELECT COUNT(*) FROM orders WHERE customer_name IS NULL;
-- New column should have data that looks right
SELECT customer_name FROM orders ORDER BY RANDOM() LIMIT 20;
Only proceed when you’re confident the new structure has correct data for all rows.
Phase 4: Contract
Remove the old structure. Since the application no longer uses it, this is safe:
-- Dropping a column is fast and doesn't rewrite the table
ALTER TABLE orders DROP COLUMN legacy_customer_id;
Then deploy application code that reads and writes only the new column, removing the dual-write logic.
Adding NOT NULL Constraints Safely
This is one of the most common migration mistakes: adding NOT NULL to a column using the naive approach, which locks the entire table.
The wrong way:
-- Acquires AccessExclusiveLock, rewrites the table
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
The right way (PostgreSQL 12+):
Use a check constraint with NOT VALID first, then validate:
-- Fast: adds constraint definition without scanning existing rows
ALTER TABLE orders
ADD CONSTRAINT orders_status_not_null
CHECK (status IS NOT NULL) NOT VALID;
-- VALIDATE acquires ShareUpdateExclusiveLock — reads and writes continue
-- Takes time proportional to table size, but doesn't block traffic
ALTER TABLE orders VALIDATE CONSTRAINT orders_status_not_null;
-- Once validated, PostgreSQL knows all rows are valid
-- This ALTER is now instant
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
-- Clean up the check constraint
ALTER TABLE orders DROP CONSTRAINT orders_status_not_null;
The VALIDATE CONSTRAINT step is slow, but it only takes a ShareUpdateExclusiveLock — reads and writes proceed normally while it runs.
Creating Indexes Without Locking
Standard CREATE INDEX blocks writes on the table during the build. For large tables, use CONCURRENTLY:
-- Standard: blocks all writes during index build
CREATE INDEX idx_orders_user_id ON orders(user_id);
-- Concurrent: doesn't block writes, but takes longer
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);
CREATE INDEX CONCURRENTLY takes two passes over the table and validates the index at the end. It can fail if a concurrent transaction conflicts with it, leaving an INVALID index behind. Always check for invalid indexes after a concurrent build:
SELECT schemaname, tablename, indexname
FROM pg_indexes
WHERE indexname IN (
SELECT indexrelid::regclass::text
FROM pg_index
WHERE NOT indisvalid
);
If you find one, drop it and rebuild:
DROP INDEX CONCURRENTLY idx_orders_user_id;
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);
Renaming Columns
Column renames seem simple but break applications that reference the old name. The expand-contract approach:
- Add a new column with the target name
- Write to both columns in application code
- Backfill the new column from the old
- Switch reads to the new column
- Drop the old column
For PostgreSQL when views or generated columns reference the old name, update those after switching reads.
Some teams use GENERATED ALWAYS AS to keep both names valid temporarily:
-- Mirror the old column under a new name during transition
ALTER TABLE orders ADD COLUMN customer_id INTEGER GENERATED ALWAYS AS (user_id) STORED;
This only works when the new name is a direct alias for the old column’s value — not for renames where the data needs transformation.
Alembic: Separate Expand and Contract
With Alembic, create separate migration versions for each phase:
# Version 1: Expand — add new column
def upgrade():
op.add_column('orders', sa.Column('customer_name', sa.Text(), nullable=True))
# Version 2: Backfill (data migration, not schema migration)
def upgrade():
op.execute("""
UPDATE orders
SET customer_name = u.full_name
FROM users u
WHERE u.id = orders.user_id
AND orders.customer_name IS NULL
""")
# Version 3: Contract — drop old column after application is updated
def upgrade():
op.drop_column('orders', 'legacy_customer_id')
Never combine expand and contract into a single migration version when they need to be deployed across separate application releases. The whole point is that the application code and schema change in coordinated, reversible steps.
PgBouncer Transaction Mode
If you run PgBouncer in transaction pooling mode (common for high-connection environments), be aware that CREATE INDEX CONCURRENTLY and ALTER TABLE ... VALIDATE CONSTRAINT require a session-level connection. They will fail or behave unexpectedly in transaction mode.
Run zero-downtime migrations through a direct PostgreSQL connection, not through PgBouncer in transaction mode. Temporarily switch to session pooling or connect directly for the duration of the migration.
A Checklist Before Every Migration
Before running any migration in production:
- Check table size:
SELECT pg_size_pretty(pg_total_relation_size('orders')). Large tables need more care. - Check current locks:
SELECT * FROM pg_locks WHERE NOT grantedto see if anything is already waiting. - Set a lock timeout:
SET lock_timeout = '3s'before the migration statement. If it can’t acquire a lock in 3 seconds, it fails fast rather than queuing and building a lock chain. - Test on production data volume: migrations that take 2 seconds on a dev database may take 20 minutes on the real table.
- Have a rollback plan: can you revert this migration if something goes wrong?
DROP COLUMNis fast; removing a NOT NULL constraint is also fast.
Setting lock_timeout is underused. Without it, a migration that can’t immediately acquire a lock queues behind active transactions, and new transactions queue behind the migration — turning a 1-second schema change into a traffic-stopping pile-up.
BEGIN;
SET lock_timeout = '3s';
ALTER TABLE orders ...;
COMMIT;
If the migration times out, you see an error immediately and no locks were held. Retry when traffic is lighter or the blocking transaction has ended.
Zero-downtime migrations take more steps than blocking ones, but the failure mode of a blocking migration in production is bad enough that the extra work is worth it every time. The expand-contract pattern in particular is worth learning once and applying by default — it removes the category of “migration that took down the site” entirely.
Sponsored
More from this category
More from Cloud & Infrastructure
GitHub Actions in 2026: Faster Pipelines, Smaller Bills
Terraform vs OpenTofu vs Pulumi: Picking Your IaC Tool in 2026
Valkey in 2026: What Happened When Redis Changed Its License
Sponsored
The dispatch
Working notes from
the studio.
A short letter twice a month — what we shipped, what broke, and the AI tools earning their keep.
Discussion
Join the conversation.
Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.
Sponsored