Zero-Downtime Database Migrations: A Field Guide for Production Systems

Every team hits this at some point: you need to rename a column, add a NOT NULL constraint, or change a data type on a table that receives thousands of writes per minute. You run the migration. It acquires a lock. Traffic starts timing out. On-call wakes up. The migration is still running.

The pain is avoidable. Zero-downtime schema changes require more steps, but each individual step is fast and safe. The overall process takes longer than a blocking migration but doesn’t affect production availability.

Why ALTER TABLE Blocks

When PostgreSQL runs most ALTER TABLE operations, it acquires an AccessExclusiveLock on the table. This lock blocks reads and writes until the operation completes. For small tables, that’s milliseconds. For tables with millions of rows or heavy write traffic, it can be seconds or minutes — long enough to cascade into application timeouts and queued requests.

Not all operations are equally expensive:

Fast (usually instant): adding a nullable column with no default, dropping a column, renaming a table
Slow (full table rewrite): adding a NOT NULL column with a default, changing a column type, adding a generated column
Index operations: standard CREATE INDEX blocks writes; CREATE INDEX CONCURRENTLY does not

The goal with zero-downtime migrations is to never run a slow, blocking operation against a live table.

The Expand-Contract Pattern

Most zero-downtime schema changes follow a pattern called expand-contract. The idea is to spread the change across multiple deployments so that each individual step is non-blocking.

Phase 1: Expand

Add the new structure without removing the old one. The application continues to use the old structure; the new structure exists but is unused.

-- Adding a new column, nullable, no default
-- Fast in PostgreSQL — no table rewrite
ALTER TABLE orders ADD COLUMN customer_name TEXT;

If you need the column to have a default, set it separately:

-- Fast: adds the default definition only
ALTER TABLE orders ALTER COLUMN customer_name SET DEFAULT '';

-- Backfill existing rows in batches to avoid long-running transactions
UPDATE orders 
SET customer_name = ''
WHERE customer_name IS NULL 
  AND id BETWEEN 1 AND 10000;

-- Repeat for each batch until complete

Phase 2: Migrate

Deploy application code that writes to both the old and new structure simultaneously. Also backfill historical data from before Phase 1.

def create_order(user_id: int, total: float, user: User):
    order = Order(
        user_id=user_id,
        total=total,
        legacy_customer_id=user.id,       # old column — still written
        customer_name=user.full_name,      # new column — written from now on
    )
    db.session.add(order)
    db.session.commit()

For reads during this phase, prefer the new column but fall back to the old:

def get_customer_display_name(order: Order) -> str:
    return order.customer_name or get_legacy_name(order.legacy_customer_id)

Phase 3: Verify

Before contracting (removing the old structure), confirm the migration is complete:

-- No nulls should remain after backfill
SELECT COUNT(*) FROM orders WHERE customer_name IS NULL;

-- New column should have data that looks right
SELECT customer_name FROM orders ORDER BY RANDOM() LIMIT 20;

Only proceed when you’re confident the new structure has correct data for all rows.

Phase 4: Contract

Remove the old structure. Since the application no longer uses it, this is safe:

-- Dropping a column is fast and doesn't rewrite the table
ALTER TABLE orders DROP COLUMN legacy_customer_id;

Then deploy application code that reads and writes only the new column, removing the dual-write logic.

Adding NOT NULL Constraints Safely

This is one of the most common migration mistakes: adding NOT NULL to a column using the naive approach, which locks the entire table.

The wrong way:

-- Acquires AccessExclusiveLock, rewrites the table
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;

The right way (PostgreSQL 12+):

Use a check constraint with NOT VALID first, then validate:

-- Fast: adds constraint definition without scanning existing rows
ALTER TABLE orders 
ADD CONSTRAINT orders_status_not_null 
CHECK (status IS NOT NULL) NOT VALID;

-- VALIDATE acquires ShareUpdateExclusiveLock — reads and writes continue
-- Takes time proportional to table size, but doesn't block traffic
ALTER TABLE orders VALIDATE CONSTRAINT orders_status_not_null;

-- Once validated, PostgreSQL knows all rows are valid
-- This ALTER is now instant
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;

-- Clean up the check constraint
ALTER TABLE orders DROP CONSTRAINT orders_status_not_null;

The VALIDATE CONSTRAINT step is slow, but it only takes a ShareUpdateExclusiveLock — reads and writes proceed normally while it runs.

Creating Indexes Without Locking

Standard CREATE INDEX blocks writes on the table during the build. For large tables, use CONCURRENTLY:

-- Standard: blocks all writes during index build
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- Concurrent: doesn't block writes, but takes longer
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);

CREATE INDEX CONCURRENTLY takes two passes over the table and validates the index at the end. It can fail if a concurrent transaction conflicts with it, leaving an INVALID index behind. Always check for invalid indexes after a concurrent build:

SELECT schemaname, tablename, indexname
FROM pg_indexes
WHERE indexname IN (
    SELECT indexrelid::regclass::text 
    FROM pg_index 
    WHERE NOT indisvalid
);

If you find one, drop it and rebuild:

DROP INDEX CONCURRENTLY idx_orders_user_id;
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);

Renaming Columns

Column renames seem simple but break applications that reference the old name. The expand-contract approach:

Add a new column with the target name
Write to both columns in application code
Backfill the new column from the old
Switch reads to the new column
Drop the old column

For PostgreSQL when views or generated columns reference the old name, update those after switching reads.

Some teams use GENERATED ALWAYS AS to keep both names valid temporarily:

-- Mirror the old column under a new name during transition
ALTER TABLE orders ADD COLUMN customer_id INTEGER GENERATED ALWAYS AS (user_id) STORED;

This only works when the new name is a direct alias for the old column’s value — not for renames where the data needs transformation.

Alembic: Separate Expand and Contract

With Alembic, create separate migration versions for each phase:

# Version 1: Expand — add new column
def upgrade():
    op.add_column('orders', sa.Column('customer_name', sa.Text(), nullable=True))

# Version 2: Backfill (data migration, not schema migration)
def upgrade():
    op.execute("""
        UPDATE orders 
        SET customer_name = u.full_name
        FROM users u 
        WHERE u.id = orders.user_id
        AND orders.customer_name IS NULL
    """)

# Version 3: Contract — drop old column after application is updated
def upgrade():
    op.drop_column('orders', 'legacy_customer_id')

Never combine expand and contract into a single migration version when they need to be deployed across separate application releases. The whole point is that the application code and schema change in coordinated, reversible steps.

PgBouncer Transaction Mode

If you run PgBouncer in transaction pooling mode (common for high-connection environments), be aware that CREATE INDEX CONCURRENTLY and ALTER TABLE ... VALIDATE CONSTRAINT require a session-level connection. They will fail or behave unexpectedly in transaction mode.

Run zero-downtime migrations through a direct PostgreSQL connection, not through PgBouncer in transaction mode. Temporarily switch to session pooling or connect directly for the duration of the migration.

A Checklist Before Every Migration

Before running any migration in production:

Check table size: SELECT pg_size_pretty(pg_total_relation_size('orders')). Large tables need more care.
Check current locks: SELECT * FROM pg_locks WHERE NOT granted to see if anything is already waiting.
Set a lock timeout: SET lock_timeout = '3s' before the migration statement. If it can’t acquire a lock in 3 seconds, it fails fast rather than queuing and building a lock chain.
Test on production data volume: migrations that take 2 seconds on a dev database may take 20 minutes on the real table.
Have a rollback plan: can you revert this migration if something goes wrong? DROP COLUMN is fast; removing a NOT NULL constraint is also fast.

Setting lock_timeout is underused. Without it, a migration that can’t immediately acquire a lock queues behind active transactions, and new transactions queue behind the migration — turning a 1-second schema change into a traffic-stopping pile-up.

BEGIN;
SET lock_timeout = '3s';
ALTER TABLE orders ...;
COMMIT;

If the migration times out, you see an error immediately and no locks were held. Retry when traffic is lighter or the blocking transaction has ended.

Zero-downtime migrations take more steps than blocking ones, but the failure mode of a blocking migration in production is bad enough that the extra work is worth it every time. The expand-contract pattern in particular is worth learning once and applying by default — it removes the category of “migration that took down the site” entirely.

Zero-Downtime Database Migrations: A Field Guide for Production Systems

Why ALTER TABLE Blocks

The Expand-Contract Pattern

Phase 1: Expand

Phase 2: Migrate

Phase 3: Verify

Phase 4: Contract

Adding NOT NULL Constraints Safely

Creating Indexes Without Locking

Renaming Columns

Alembic: Separate Expand and Contract

PgBouncer Transaction Mode

A Checklist Before Every Migration

What an AI Feature Actually Costs: The Budget Lines Nobody Plans For

Fixed Price vs Time and Materials: The Contract Decision That Shapes Every Project

More from Cloud & Infrastructure

GitHub Actions in 2026: Faster Pipelines, Smaller Bills

Terraform vs OpenTofu vs Pulumi: Picking Your IaC Tool in 2026

Valkey in 2026: What Happened When Redis Changed Its License

Working notes from
the studio.

Join the conversation.

Why ALTER TABLE Blocks

The Expand-Contract Pattern

Phase 1: Expand

Phase 2: Migrate

Phase 3: Verify

Phase 4: Contract

Adding NOT NULL Constraints Safely

Creating Indexes Without Locking

Renaming Columns

Alembic: Separate Expand and Contract

PgBouncer Transaction Mode

A Checklist Before Every Migration

What an AI Feature Actually Costs: The Budget Lines Nobody Plans For

Fixed Price vs Time and Materials: The Contract Decision That Shapes Every Project

More from Cloud & Infrastructure

GitHub Actions in 2026: Faster Pipelines, Smaller Bills

Terraform vs OpenTofu vs Pulumi: Picking Your IaC Tool in 2026

Valkey in 2026: What Happened When Redis Changed Its License

Working notes fromthe studio.

Join the conversation.

Working notes from
the studio.