ArchitectureStaff

Staff Prep 23: Postgres VACUUM & Table Bloat — Dead Tuples & Autovacuum Tuning

April 4, 20269 min readPART 19 / 19

Back to Part 22: Partitioning & Sharding. VACUUM is the janitor of Postgres. Ignore it and dead tuples accumulate, tables bloat, queries slow down, and eventually transaction ID wraparound puts your database at risk of data loss. Most teams leave autovacuum on its defaults. Those defaults were written for the databases Postgres shipped with in 2003, not the 500GB tables we run today. It shows.

Why dead tuples exist and what they cost

As covered in Part 21 (MVCC), every UPDATE creates a new row version and marks the old one as dead. DELETEs mark rows as dead. Dead rows stay in the heap until VACUUM reclaims the space.

The performance cost of dead tuples:

Sequential scans read dead rows and then discard them, which is wasted I/O.
Index scans find entries pointing to dead rows, fetch the heap page, confirm the row is dead, and throw the work away.
On write-heavy tables autovacuum falls behind and bloat compounds from there.

I have inherited exactly one production database that was healthy on this front. The rest were all in a quiet bloat crisis that nobody had noticed yet.

sql

-- Find tables with the most bloat
SELECT
    schemaname,
    tablename,
    n_live_tup,
    n_dead_tup,
    ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
    pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size,
    last_autovacuum,
    last_analyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY dead_pct DESC;

-- Estimate actual on-disk bloat (more accurate)
-- Uses pgstattuple extension
CREATE EXTENSION IF NOT EXISTS pgstattuple;

SELECT
    schemaname,
    tablename,
    (pgstattuple(schemaname || '.' || tablename)).dead_tuple_percent AS dead_pct,
    (pgstattuple(schemaname || '.' || tablename)).free_percent AS free_pct
FROM pg_tables WHERE schemaname = 'public'
LIMIT 20;

Autovacuum: how it decides when to run

sql

-- Autovacuum triggers on a table when:
-- n_dead_tup >= autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * n_live_tup

-- Default settings:
-- autovacuum_vacuum_threshold = 50       (absolute minimum dead tuples)
-- autovacuum_vacuum_scale_factor = 0.2   (20% of live tuples)

-- For a 10M-row table:
-- threshold = 50 + 0.2 * 10,000,000 = 2,000,050 dead tuples before vacuum runs
-- That is 20% dead tuples = significant bloat before cleanup

-- Problem: for large tables, the default 20% is too high
-- Fix: override per-table for write-heavy large tables
ALTER TABLE orders SET (
    autovacuum_vacuum_scale_factor = 0.01,    -- vacuum at 1% dead tuples (100k rows)
    autovacuum_analyze_scale_factor = 0.005,  -- analyze at 0.5% changes
    autovacuum_vacuum_threshold = 1000,
    autovacuum_vacuum_cost_delay = 2          -- run faster (default 20ms)
);

-- Check current autovacuum settings for a table
SELECT reloptions FROM pg_class WHERE relname = 'orders';

Manual VACUUM: when to use it

sql

-- Regular VACUUM: marks dead tuples as reusable, does NOT return space to OS
VACUUM orders;

-- VACUUM ANALYZE: vacuum + update statistics (do both)
VACUUM ANALYZE orders;

-- VACUUM VERBOSE: see what it is doing
VACUUM VERBOSE orders;

-- VACUUM FULL: compacts the table, returns space to OS
-- WARNING: takes AccessExclusiveLock — blocks all reads and writes
-- Use pg_repack instead for online VACUUM FULL
VACUUM FULL orders;  -- DANGEROUS: use only during maintenance window

-- pg_repack: online full vacuum (no lock)
-- Install: apt install postgresql-16-repack
-- pg_repack --table orders mydb

-- FREEZE: prevent transaction ID wraparound
VACUUM FREEZE orders;  -- marks all rows as frozen (safe from wraparound)

Index bloat: the forgotten sibling

sql

-- Indexes also accumulate dead entries
-- B-tree indexes grow monotonically; VACUUM marks dead entries but space is not always reused

-- Check index bloat
SELECT
    indexrelid::regclass AS index_name,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 20;

-- REINDEX: rebuild an index from scratch
-- Blocking version (development only):
REINDEX INDEX idx_orders_user_id;

-- Online reindex (Postgres 12+):
REINDEX INDEX CONCURRENTLY idx_orders_user_id;

-- Reindex entire table (all indexes):
REINDEX TABLE CONCURRENTLY orders;

Monitoring autovacuum activity

sql

-- See currently running autovacuum processes
SELECT pid, query, state, wait_event_type, wait_event,
       now() - pg_stat_activity.query_start AS duration
FROM pg_stat_activity
WHERE query LIKE 'autovacuum%';

-- See vacuum history per table
SELECT
    relname,
    last_vacuum, last_autovacuum,
    vacuum_count, autovacuum_count,
    last_analyze, last_autoanalyze
FROM pg_stat_user_tables
ORDER BY last_autovacuum NULLS FIRST
LIMIT 20;

-- Tables that autovacuum has never touched:
SELECT tablename FROM pg_stat_user_tables
WHERE last_autovacuum IS NULL AND last_vacuum IS NULL;

-- If autovacuum cannot keep up: check for long-running transactions
-- that hold back the oldest transaction ID (xmin horizon)
SELECT pid, now() - xact_start AS age, state, query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start
LIMIT 5;
-- Long-running transactions prevent VACUUM from reclaiming dead tuples
-- Kill them if they are stuck

Quiz: test your understanding

Before moving on, answer these in your head (or out loud):

A table has 10M live rows and 3M dead rows. Using default autovacuum settings, at how many dead tuples will autovacuum trigger? Is 3M enough? What do you set to fix this?
What is the difference between VACUUM and VACUUM FULL? Why is VACUUM FULL dangerous in production?
After running VACUUM, a table still shows the same size on disk. Why? When does disk space actually get returned to the OS?
A long-running analytics transaction has been running for 6 hours. You notice dead tuple counts growing rapidly on your main tables. What is the connection between the two?
How do you rebuild a bloated index on a production table without taking downtime? What are the caveats?

Next up: Part 24: Redis Caching Patterns. Sorted sets, pub/sub, streams, TTL strategies and the data structures you are underusing.

← PREV

Staff Prep 22: Postgres Partitioning & Sharding — When and How