Staff Prep 21: Postgres MVCC — Multi-Version Concurrency Under the Hood
Back to Part 20: API Design at Scale. MVCC (Multi-Version Concurrency Control) is the most important concurrency mechanism in Postgres. It is why readers and writers do not block each other. It is also why VACUUM exists and why your tables bloat when you ignore autovacuum tuning. Most Senior engineers I interview can define MVCC. Few can tell me what xmin actually contains, and that gap is usually where the interesting questions live.
What MVCC is and why it exists
Traditional locking systems block reads when a write is in progress. Postgres sidesteps this through versioning. Instead of updating a row in place, it writes a new version and marks the old one as dead. Readers see whatever version was current when their transaction started, regardless of concurrent writes.
The net effect: reads never block writes, writes never block reads. That is the whole foundation of Postgres's concurrency story, and everything else (VACUUM, bloat, wraparound) is the bill you pay for it.
Xmin, xmax: the visibility system
-- Every row has hidden system columns:
-- xmin: transaction ID that created this row version
-- xmax: transaction ID that deleted/updated this row version (0 = still live)
-- ctid: physical location (block, offset) of this row version
SELECT xmin, xmax, ctid, id, name
FROM users WHERE id = 42;
-- Result:
-- xmin=1024, xmax=0, ctid=(5,3), id=42, name="Alice"
-- xmin=1024: created by transaction 1024
-- xmax=0: no transaction has deleted it (still live)
-- After an UPDATE:
UPDATE users SET name = 'Alicia' WHERE id = 42;
-- Old row: xmin=1024, xmax=1025, ctid=(5,3) ← marked dead by txn 1025
-- New row: xmin=1025, xmax=0, ctid=(7,1) ← new version at new physical location
-- Both rows exist in the heap simultaneously
-- A transaction with snapshot before 1025 sees the old row
-- A transaction with snapshot after 1025 commits sees the new row
Transaction snapshots
-- What does a transaction's "snapshot" contain?
-- xmin: lowest active transaction ID at snapshot time
-- xmax: highest transaction ID at snapshot time + 1
-- xip_list: list of active (in-progress) transaction IDs at snapshot time
-- Visibility rule for a row version (simplified):
-- A row is visible to my snapshot if:
-- row.xmin < my_snapshot.xmax (row was created before my snapshot)
-- AND row.xmin NOT IN my_snapshot.xip_list (creator was committed, not in-progress)
-- AND (row.xmax = 0 (row is not deleted)
-- OR row.xmax > my_snapshot.xmax (deleter started after my snapshot)
-- OR row.xmax IN my_snapshot.xip_list) (deleter not yet committed)
-- This is why READ COMMITTED gets a fresh snapshot per statement
-- REPEATABLE READ gets a snapshot once at transaction start and holds it
Dead tuples and their impact
-- Dead tuples accumulate from UPDATEs and DELETEs
-- Old row versions are kept until VACUUM reclaims them
-- Monitor dead tuple accumulation
SELECT
schemaname,
tablename,
n_live_tup,
n_dead_tup,
ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_pct,
last_vacuum,
last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC LIMIT 10;
-- A table with 30% dead tuples is reading 43% more data than necessary
-- Queries scan dead tuples before discarding them (unless heap_only_tuple optimization applies)
-- Force VACUUM when bloat is high
VACUUM ANALYZE orders;
-- VACUUM FULL: reclaims space and compacts, but takes AccessExclusiveLock
-- Use pg_repack instead (FULL without lock):
-- pg_repack --table orders mydb
Transaction ID wraparound: the catastrophic failure mode
-- Transaction IDs are 32-bit unsigned integers
-- Maximum: ~2 billion (2^32 / 2 = ~2.1 billion, with wraparound protection)
-- If VACUUM does not freeze old transactions, the DB will go read-only to prevent data loss
-- Check for wraparound risk
SELECT
datname,
age(datfrozenxid) AS age,
2000000000 - age(datfrozenxid) AS txns_until_wraparound
FROM pg_database
ORDER BY age(datfrozenxid) DESC;
-- If age > 1.5 billion: URGENT — run VACUUM FREEZE immediately
-- If age > 2 billion: Postgres will shut down with "database is not accepting commands to avoid wraparound"
-- Freeze old transactions proactively
VACUUM FREEZE users;
-- Configure autovacuum to freeze aggressively
ALTER TABLE high_volume_table SET (
autovacuum_freeze_max_age = 500000000 -- freeze at 500M (default: 200M)
);
HOT updates: MVCC optimisation
-- HOT = Heap-Only Tuple: an optimisation when:
-- 1. The updated columns are NOT indexed
-- 2. The new row version fits on the same page as the old one
-- With HOT: no index entries are updated (cheaper write, faster vacuum)
-- Without HOT: every index must be updated for every row update
-- Check HOT ratio
SELECT
relname,
n_tup_upd,
n_tup_hot_upd,
ROUND(100.0 * n_tup_hot_upd / NULLIF(n_tup_upd, 0), 1) AS hot_pct
FROM pg_stat_user_tables
ORDER BY n_tup_upd DESC;
-- Low HOT ratio (< 50%): check if you are updating indexed columns unnecessarily
-- Common fix: use separate columns for mutable and immutable data
-- Or increase fillfactor to leave room for HOT updates on the same page:
ALTER TABLE users SET (fillfactor = 70); -- leave 30% free space per page for HOT updates
Quiz: test your understanding
Before moving on, answer these in your head (or out loud):
- Why do readers never block writers in Postgres? What mechanism allows this?
- You run
UPDATE users SET name = 'Bob' WHERE id = 1. What physically happens in the heap? How many row versions exist after this command? - What are dead tuples? How do they accumulate? What happens if VACUUM never runs?
- Explain transaction ID wraparound in one paragraph. What is the failure mode if it is not managed?
- What is a HOT update and when does it apply? How do you increase your HOT ratio?
Next up: Part 22: Partitioning & Sharding. Range, list, hash partitioning, partition pruning, and when to actually shard.