How a NOT NULL Column Migration Locked Our Users Table for 14 Minutes
10:03 AM, Monday. Every board in the dashboard goes red at the same instant. API, web, mobile backend, all of it. Pager-duty starts stacking notifications faster than I can dismiss them.
Two minutes earlier we'd shipped a schema migration. One of those migrations I'd specifically filed under "routine" in my head on the way to the review, which in hindsight is the universal signal you're about to have a bad morning.
The migration was one line. Adding a column with a safe default value. What followed was 14
minutes of a full exclusive lock on our 2.4-million-row users table, and a lesson
about Postgres internals I now mentally replay every time I review an ALTER TABLE.
Production failure
We were running a multi-tenant SaaS. The users table had 2.4 million rows across
all tenants. Not enormous by warehouse standards, but a core operational table hit on nearly
every authenticated API request. Our deploy pipeline ran database migrations first, before
rolling out new application code. That ordering plus a trivial-looking column addition took
down everything.
The migration added an onboarding_source column to track how users had signed up:
organic, referral, or sales-assisted demo. The column needed a default for existing rows.
-- This migration ran fine in staging (50,000 rows).
-- In production (2.4M rows), it acquired an exclusive lock for 14 minutes.
ALTER TABLE users
ADD COLUMN onboarding_source VARCHAR(64) NOT NULL DEFAULT 'organic';
In staging, with 50,000 rows, this migration ran in 1.2 seconds. In production, with 2.4 million rows, it ran for 14 minutes and held an exclusive lock on the entire table the whole way. No reads, no writes, nothing.
False assumptions
Our first instinct was to blame the application deploy, not the migration. New code had shipped simultaneously. We rolled back the application tier, new pods swapped in within 3 minutes. The outage kept going. Whatever was wrong wasn't in the app layer.
Second theory: connection pool exhaustion cascade. We'd seen this pattern before. A slow operation holds connections, the pool fills, new connections queue, latency spikes, system seizes. PgBouncer metrics showed 400 connections waiting. That matched the symptom but wasn't the cause. Something was holding those connections, not the other way around.
It took us seven minutes to actually check what was running inside Postgres. Seven minutes in which I was also trying to explain to the CEO over Slack what "probably an application bug" meant.
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
-- pid | duration | state | query
-- -------+-------------+---------+------------------------------------------
-- 48291 | 00:03:42 | active | ALTER TABLE users ADD COLUMN onboarding_source...
-- 48302 | 00:03:42 | active | SELECT id, email FROM users WHERE... (waiting)
-- 48317 | 00:03:41 | active | UPDATE users SET last_login_at = ... (waiting)
-- 48331 | 00:03:41 | active | SELECT * FROM users WHERE tenant_id = ... (waiting)
-- ... (394 more waiting queries)
There it was. PID 48291 running for 3 minutes 42 seconds. The ALTER TABLE had an
exclusive lock on users. Every other query touching the table (SELECTs, UPDATEs,
all of it) was queued behind it. 397 queries piled up and waiting.
Root cause: what Postgres does with NOT NULL + DEFAULT
The root cause was a misunderstanding of how Postgres handles schema changes internally. It's the kind of thing that's easy to get wrong even after years of working with Postgres, and I say that as someone who thought he knew.
Before Postgres 11, adding any column with a DEFAULT value required a full
table rewrite. Postgres scanned every existing row and physically wrote the new column's value
into each one. For 2.4 million rows of variable-width VARCHAR data, that meant
re-serializing 2.4 million tuples on disk while holding an ACCESS EXCLUSIVE lock
that blocks every read and write to the table.
Postgres 11 introduced the optimisation I thought protected me. Adding a column with a
non-volatile DEFAULT (a constant, not NOW() or a function) no longer
requires a rewrite. Postgres stores the default in the catalog and serves it virtually for old
rows, only writing the value on future updates. We were on Postgres 14, I knew about this
optimisation, and I still walked into the lock.
The culprit was the NOT NULL constraint.
Precise behaviour in Postgres 14: adding a column with DEFAULT alone is fast. Adding
NOT NULL DEFAULT forces Postgres to validate that no existing row has a NULL in that
column (requires a full scan), and for certain column types and storage combinations it triggers
the legacy full-rewrite path. Our VARCHAR(64) NOT NULL DEFAULT hit the slow path.
Postgres rewrote the entire table.
POSTGRES ALTER TABLE — FAST PATH vs. SLOW PATH
─────────────────────────────────────────────────────────────────────
ADD COLUMN col VARCHAR(64) DEFAULT 'organic' ← Postgres 11+: FAST ✓
───────────────────────────────────────────────
No table rewrite. Default stored in pg_attrdef.
Old rows serve the default value virtually.
Lock duration: milliseconds.
─────────────────────────────────────────────────────────────────────
ADD COLUMN col VARCHAR(64) NOT NULL DEFAULT 'organic' ← SLOW PATH ✗
──────────────────────────────────────────────────────
Postgres must validate NOT NULL across all existing rows.
Triggers full table rewrite to physically embed the default.
Acquires ACCESS EXCLUSIVE lock for entire duration.
ACCESS EXCLUSIVE lock blocks:
✗ SELECT ✗ INSERT ✗ UPDATE
✗ DELETE ✗ VACUUM ✗ any concurrent access
2.4M rows × rewrite overhead = 14 minutes of total outage.
─────────────────────────────────────────────────────────────────────
The irony is a bit on the nose: we added NOT NULL to enforce data integrity, and
that exact constraint turned a 1-second migration into a 14-minute outage.
Staging never caught this because 50,000 rows rewrote in 1.2 seconds, fast enough that nobody noticed the lock was held. The behaviour was identical. The scale was not.
Architecture fix: the safe zero-downtime migration pattern
The fix was rethinking how we write any schema migration that touches a large, live table. The pattern is to decompose a single dangerous operation into three safe steps across separate releases.
SAFE MIGRATION: 3-STEP ZERO-DOWNTIME PATTERN
─────────────────────────────────────────────────────────────────────
Step 1 — Deploy migration: ADD COLUMN nullable, no constraint
─────────────────────────────────────────────────────────────
ALTER TABLE users ADD COLUMN onboarding_source VARCHAR(64);
-- Fast path. No rewrite. Milliseconds.
-- Existing rows: NULL (allowed for now)
-- New rows: application code writes the value explicitly
─────────────────────────────────────────────────────────────────────
Step 2 — Backfill existing rows in batches (background job)
─────────────────────────────────────────────────────────────
UPDATE users
SET onboarding_source = 'organic'
WHERE onboarding_source IS NULL
AND id BETWEEN ${batchStart} AND ${batchEnd};
-- No table lock. Small batches. Runs offline at low priority.
-- Takes minutes, not seconds — but table stays fully live.
─────────────────────────────────────────────────────────────────────
Step 3 — Deploy migration: add NOT NULL + default after backfill
─────────────────────────────────────────────────────────────────
ALTER TABLE users
ALTER COLUMN onboarding_source SET DEFAULT 'organic',
ALTER COLUMN onboarding_source SET NOT NULL;
-- Postgres 12+: NOT NULL validated via constraint, not rewrite
-- Only safe once zero NULLs remain in the column
─────────────────────────────────────────────────────────────────────
This spreads the work across three deploys and a background job. The table is never locked for more than a few milliseconds at any point. Users see nothing.
The backfill script processes rows in chunks of 10,000 with a 50ms pause between batches to avoid overwhelming the primary.
import { Pool } from 'pg';
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
const BATCH_SIZE = 10_000;
const SLEEP_MS = 50;
async function sleep(ms: number) {
return new Promise(resolve => setTimeout(resolve, ms));
}
async function backfill() {
const { rows: [{ max }] } = await pool.query('SELECT MAX(id) FROM users');
const maxId: number = max;
let batchStart = 1;
let totalUpdated = 0;
while (batchStart <= maxId) {
const batchEnd = batchStart + BATCH_SIZE - 1;
const result = await pool.query(
`UPDATE users
SET onboarding_source = 'organic'
WHERE onboarding_source IS NULL
AND id BETWEEN $1 AND $2`,
[batchStart, batchEnd]
);
totalUpdated += result.rowCount ?? 0;
console.log(`Backfilled up to id=${batchEnd} | total=${totalUpdated}`); // eslint-disable-line
batchStart = batchEnd + 1;
await sleep(SLEEP_MS); // Yield time to avoid replica lag buildup
}
console.log(`Backfill complete. ${totalUpdated} rows updated.`);
await pool.end();
}
backfill().catch(console.error);
The backfill took about 18 minutes across 2.4 million rows. Longer than the original migration
would have taken if it had worked. The difference is the users table stayed fully
accessible the entire time. We ran it at 2 AM on a low-traffic night, verified zero NULL rows,
and deployed Step 3. The final NOT NULL constraint addition completed in under 200
milliseconds.
What we changed after this
The incident led to three durable changes in how we manage schemas.
-
A migration review checklist enforced in CI. Every migration touching a table with more
than 100,000 rows must pass a review gate. The checklist flags any
ADD COLUMN ... NOT NULL,ADD CONSTRAINT, column type changes, and index creation withoutCONCURRENTLY. PRs with flagged patterns require sign-off from a second engineer who has read the checklist. -
Production row counts in the migration review context. A GitHub Actions step fetches live
table sizes from a read replica and posts them as a PR comment alongside any migration file
changes. Reviewers no longer have to remember that the
userstable is 10x larger in production than in staging. -
A
lock_timeouton all migration runs. We now setlock_timeout = '5s'andstatement_timeout = '30s'as session-level settings before any production migration. If a migration cannot acquire its lock within 5 seconds, it fails fast instead of queueing behind live traffic. A failed migration is infinitely better than a silent 14-minute lock.
-- Applied before every production migration
SET lock_timeout = '5s'; -- Fail fast if we can't get the lock
SET statement_timeout = '30s'; -- Kill runaway migrations before they spiral
SET idle_in_transaction_session_timeout = '10s';
-- The actual migration runs here
i 20260315_add_onboarding_source_step1.sql
RESET lock_timeout;
RESET statement_timeout;
RESET idle_in_transaction_session_timeout;
Lessons learned
-
NOT NULL + DEFAULT is not the same as DEFAULT alone. Postgres 11+ optimised the nullable
case; the
NOT NULLcase is not fully optimised. Decompose: add nullable, backfill, then constrain. - Staging row counts are fiction. A migration that takes 1 second on 50,000 rows will not take 48 seconds on 2.4 million rows; it may hit an entirely different code path with quadratic I/O characteristics. Benchmark migrations against a production-sized snapshot.
-
Add indexes CONCURRENTLY, always.
CREATE INDEXholds a lock for the full build duration, same as the table rewrite. There is no operational reason to ever create an index non-concurrently on a live table. -
lock_timeoutis your circuit breaker. If your migration cannot acquire its lock immediately, it should abort cleanly rather than queue behind 400 live connections and make the outage worse. - Run migrations separately from code deployments. We now treat schema migrations as independent deploys with their own approval step. A migration problem should never take down application code that was working fine before.
The outage lasted 14 minutes. The incident review lasted two hours. The checklist and CI changes
took a day to implement. We've shipped dozens of schema migrations since, including several on
tables larger than users, and none of them have required so much as a maintenance
window. Adding a constraint to 2.4 million rows of existing data is not a schema change. It's a
data migration. We treat it like one now.