Postgres Internals 108: EXPLAIN & Query Planning
In Part 7 we covered how Postgres
manages concurrent access through MVCC and locking. Now for the performance side: why is that query slow?
Postgres has a built-in tool that answers this question with surgical precision — EXPLAIN ANALYZE.
This is Part 8: how the query planner works, how to read a query plan, and the patterns that reliably
turn slow queries fast.
What the query planner does
When you run a query, Postgres doesn't execute it immediately. First, it hands the parsed query to the query planner. The planner's job: find the cheapest way to execute the query.
"Cheapest" means estimated total I/O and CPU cost. The planner generates multiple candidate plans
(different join orders, different scan strategies), estimates the cost of each using column statistics
stored in pg_statistic, and picks the plan with the lowest estimated cost.
SQL Query
↓
Parser (syntax tree)
↓
Rewriter (apply rules, views)
↓
Planner (generate + cost candidate plans, pick winner)
↓
Executor (run the plan)
The planner is not magic. It estimates based on statistics, and statistics can be wrong. A bad estimate leads to a bad plan — this is the root cause of most "why is this query suddenly slow" incidents.
EXPLAIN vs EXPLAIN ANALYZE
EXPLAIN shows you the plan the planner chose — estimated costs and row counts,
but it does not run the query.
EXPLAIN ANALYZE actually executes the query and shows both estimated and actual numbers.
This is what you use when debugging: the gap between estimated and actual rows is where bad plans hide.
-- Shows plan only — safe to run on production queries without side effects
EXPLAIN SELECT * FROM orders WHERE user_id = 42;
-- Executes the query AND shows actual timing and row counts
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 42;
-- Most useful flags for debugging
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE user_id = 42;
BUFFERS shows cache hits vs disk reads — critical for understanding whether slowness
is I/O-bound or CPU-bound. If you see high Buffers: shared read numbers, your working
set doesn't fit in shared_buffers and you're hitting disk.
Warning: EXPLAIN ANALYZE runs the query. For DELETEs or UPDATEs,
wrap in a transaction and roll back:
BEGIN;
EXPLAIN ANALYZE DELETE FROM logs WHERE created_at < '2025-01-01';
ROLLBACK;
Reading a query plan
A query plan is a tree. Each node is an operation. The innermost nodes run first; results flow upward to the root. Each node shows:
Node Type (cost=startup..total rows=estimated_rows width=row_bytes)
(actual time=startup..total rows=actual_rows loops=N)
Seq scan
A full table scan. Reads every row, filters as it goes. Fast when you need most of the table. Slow and alarming when you expected an index scan.
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 42;
-- Output:
Seq Scan on orders (cost=0.00..18340.00 rows=1 width=128)
(actual time=2.453..234.231 rows=1 loops=1)
Filter: (user_id = 42)
Rows Removed by Filter: 450000
450,000 rows scanned to return 1. There's no index on user_id, or the planner
decided the index wasn't worth using (see: low selectivity).
Index scan
Uses a B-tree index to find rows, then fetches them from the heap. Fast for selective queries. Each index lookup is a random I/O to the heap — "random" means potentially different pages.
-- With an index on user_id:
Index Scan using orders_user_id_idx on orders (cost=0.43..8.45 rows=1 width=128)
(actual time=0.042..0.043 rows=1 loops=1)
Index Cond: (user_id = 42)
Index only scan
All columns needed are in the index — no heap access required. The fastest scan type.
Requires the columns in SELECT and WHERE to all be in the index.
-- If index is on (user_id) and we only SELECT user_id:
Index Only Scan using orders_user_id_idx on orders
(cost=0.43..4.45 rows=1 width=8)
Heap Fetches: 0 ← no heap I/O
Bitmap heap scan
Two-phase scan. First, a Bitmap Index Scan collects all matching TIDs (row pointers) from the index into a bitmap. Then the Bitmap Heap Scan reads heap pages in physical order (reducing random I/O). Appears when a query matches many rows — more efficient than individual Index Scans at medium selectivity.
Bitmap Heap Scan on orders (cost=12.34..180.23 rows=450 width=128)
Recheck Cond: (status = 'pending')
-> Bitmap Index Scan on orders_status_idx
Index Cond: (status = 'pending')
Hash join
For joining two tables. Builds a hash table from the smaller relation, then probes it with the
larger. Fast when one side fits in work_mem. If it spills to disk, you'll see
"Batches: N" with N > 1 — increase work_mem for that session.
Hash Join (cost=1234.56..5678.90 rows=450 width=256)
Hash Cond: (orders.user_id = users.id)
-> Seq Scan on orders
-> Hash
Buckets: 1024 Batches: 1 Memory Usage: 48kB ← fits in memory
-> Seq Scan on users
Nested loop
For each row in the outer relation, scan the inner relation. Efficient when the outer set is small and the inner has an index. Terrible when both sides are large — O(n×m) complexity.
Merge join
Requires both sides to be sorted on the join key. Works well when both sides are already sorted (index scans in order) or when sorting them is cheap. Uses a linear merge — O(n+m).
The most important numbers
Rows: estimated vs actual
This is the single most diagnostic number in a query plan. When estimated rows differ wildly from actual rows, the planner made a bad decision based on bad statistics.
-- Estimated: 1 row. Actual: 45,000 rows.
-- The planner expected a tiny result and chose a Nested Loop.
-- With 45,000 rows, a Hash Join would have been 100x faster.
Nested Loop (cost=0.43..890.12 rows=1 width=256)
(actual time=0.05..8234.12 rows=45000 loops=1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ← catastrophic estimate miss
Actual time
Format: actual time=startup_ms..total_ms. startup is time to return
the first row; total is time to return all rows. The root node's total time is your
query's wall clock time.
Loops
In a Nested Loop, inner nodes execute once per outer row. A node showing loops=450
ran 450 times. The actual time shown is per loop — multiply by loops for total time. A slow inner
node in a loop is magnified.
-- This looks fast — 2ms
Index Scan ... (actual time=0.01..2.00 rows=10 loops=450)
-- But it ran 450 times — total: 900ms just for this node
-- Total rows: 450 × 10 = 4,500
pg_stats and ANALYZE
The planner's estimates come from statistics stored in pg_stats. Postgres collects these
by sampling a fraction of each table's rows — by default, 300 sample values per column
(default_statistics_target).
-- See what statistics Postgres has for a column
SELECT
attname,
n_distinct,
correlation,
most_common_vals,
most_common_freqs
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';
ANALYZE refreshes these statistics. Autovacuum runs it automatically, but if you just
loaded a million rows or your data distribution changed significantly, run it manually:
ANALYZE orders; -- update stats for the whole table
ANALYZE orders (user_id); -- update stats for one column only
For skewed data — columns where a few values appear very frequently — increase the statistics target for that column to give the planner more detail:
-- Collect 1000 sample values for status column instead of default 300
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders;
Common bad plans and how to fix them
Sequential scan on an indexed column
The planner chose a Seq Scan despite an index existing. Two common reasons:
1. Low selectivity. If status = 'active' matches 80% of the table,
using the index is slower than a sequential scan. The planner is correct to ignore it. Solution:
partial indexes on the minority value.
-- Index only the rare status values
CREATE INDEX orders_status_pending_idx ON orders (status)
WHERE status IN ('pending', 'processing');
2. Stale statistics. The planner thinks the column has low selectivity because
stats haven't been updated since a data shape change. Run ANALYZE.
Implicit type cast killing index use
This is a common trap. An index on user_id INT won't be used if you query with
a string literal — Postgres casts the column, not the literal, making the index expression non-matching.
-- user_id is INT, but we pass a string — index won't be used
SELECT * FROM orders WHERE user_id = '42'; -- ❌ implicit cast on column
-- Correct: pass the right type
SELECT * FROM orders WHERE user_id = 42; -- ✅ index used
-- Or: check your ORM isn't sending strings for integer columns
-- In psycopg2: cursor.execute("... WHERE user_id = %s", [42]) -- 42 as int, not str
Bad join order
The planner joins tables in the order it estimates produces the smallest intermediate result first. If statistics are wrong, it might join a large table first, creating a huge intermediate. You can force join order as a diagnostic:
SET join_collapse_limit = 1; -- disables join reordering, uses query order
EXPLAIN ANALYZE
SELECT * FROM orders o
JOIN users u ON u.id = o.user_id
JOIN products p ON p.id = o.product_id
WHERE o.status = 'pending';
Debugging with enable_seqscan = off
When you're trying to understand why Postgres isn't using an index you expect it to, turn off sequential scans for the session. This forces the planner to use indexes — helpful for seeing what plan it would generate if it did:
SET enable_seqscan = off;
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 42;
-- Now see what the index scan plan costs vs what the seq scan would have been
SET enable_seqscan = on; -- always restore
Never leave this set in production. Use it only as a diagnostic in a session.
Finding slow queries in production
pg_stat_statements
The most important extension for production Postgres. It tracks execution statistics for every query pattern across the whole database.
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Top 10 queries by total time
SELECT
query,
calls,
total_exec_time / 1000 AS total_sec,
mean_exec_time AS mean_ms,
rows,
stddev_exec_time AS stddev_ms
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
-- Queries with worst average latency (min 100 calls)
SELECT
query,
calls,
mean_exec_time AS mean_ms,
max_exec_time AS max_ms
FROM pg_stat_statements
WHERE calls > 100
ORDER BY mean_exec_time DESC
LIMIT 10;
The output normalizes query parameters — WHERE id = $1 — so all calls to the same
query pattern are aggregated, regardless of the parameter value. This is what makes it useful
for spotting the one query that runs 50,000 times a day with a 40ms average.
auto_explain for logging slow query plans
auto_explain logs the full EXPLAIN ANALYZE output for any query that exceeds a time
threshold — without any application changes. Invaluable for catching slow queries in production
without manually adding EXPLAIN everywhere.
-- Enable auto_explain (in postgresql.conf or per-session)
LOAD 'auto_explain';
SET auto_explain.log_min_duration = 1000; -- log plans for queries > 1 second
SET auto_explain.log_analyze = true;
SET auto_explain.log_buffers = true;
-- The next slow query will write its full plan to the Postgres log
-- Check: tail -f /var/log/postgresql/postgresql.log
To enable globally without a restart (Postgres 14+):
ALTER SYSTEM SET session_preload_libraries = 'auto_explain';
ALTER SYSTEM SET auto_explain.log_min_duration = '1s';
SELECT pg_reload_conf();
Quiz
Q1. EXPLAIN shows rows=1 estimated but EXPLAIN ANALYZE shows rows=45000 actual. What's the most likely cause and fix?
Stale or insufficient statistics. The planner sampled the table when the column had a very different distribution — perhaps when the table had far fewer rows, or the column's value distribution changed. With only 1 row expected, the planner probably chose a Nested Loop join or an Index Scan that performs poorly at 45,000 rows.
Fix: Run ANALYZE tablename to refresh statistics. If the distribution is highly skewed (a few values are very common), increase default_statistics_target for that column: ALTER TABLE t ALTER COLUMN c SET STATISTICS 500; ANALYZE t;
Q2. You add an index on email VARCHAR but queries with WHERE LOWER(email) = 'user@example.com' still do Seq Scans. Why?
The index is on email, not on LOWER(email). When you apply a function to an indexed column in the WHERE clause, Postgres can't use the standard index — it would need to apply LOWER() to every indexed value to compare, defeating the index.
Fix: Create an expression index: CREATE INDEX email_lower_idx ON users (LOWER(email)); Now queries with WHERE LOWER(email) = ... will use this index directly.
Q3. A Hash Join node shows Batches: 8. What does this mean and how do you fix it?
The hash table spilled to disk. When the hash table for a join doesn't fit in work_mem, Postgres splits it into batches and writes partial hash tables to disk. Batches: 8 means 8 disk writes and reads — this is very slow.
Fix: Increase work_mem for the session running the query: SET work_mem = '256MB'; EXPLAIN ANALYZE .... Check if Batches drops to 1. Don't increase it globally unless you're sure about memory impact — work_mem applies per sort/hash operation, not per connection, and one complex query can use it multiple times.
Q4. What's the difference between pg_stat_statements and auto_explain? When do you use each?
pg_stat_statements: Aggregates query statistics over time. You use it for identifying which queries are consuming the most resources — total time, call count, average latency. It doesn't show you the query plan, just the metrics. Use it first to find the culprits.
auto_explain: Logs the full EXPLAIN ANALYZE plan for individual query executions that exceed a time threshold. You use it after you've identified a slow query pattern to see exactly what plan it ran and where the time was spent. The output is one plan per execution, written to the server log.
Typical workflow: pg_stat_statements to find slow query patterns → auto_explain (or manual EXPLAIN ANALYZE) to diagnose the plan → fix the index/statistics/query → verify with EXPLAIN ANALYZE again.
Part 8 done. Next up — Part 9: Vacuum & Autovacuum. Postgres never overwrites rows in place — MVCC means dead tuples accumulate. Ignore VACUUM and your database bloats, slows, and eventually hits transaction ID wraparound.