PostgresProduction

Postgres Internals 108: EXPLAIN & Query Planning

March 29, 202610 min readPART 15 / 19

In Part 7 we covered how Postgres manages concurrent access through MVCC and locking. Now for the performance side: why is that query slow? Postgres has a built-in tool that answers this question with surgical precision: EXPLAIN ANALYZE. This is Part 8. How the query planner works, how to read a query plan, and the patterns that reliably turn slow queries fast.

(Honest aside before we get going: I lived inside EXPLAIN for about a week once, debugging a single query that should have taken 4ms and was taking 9 seconds. The lesson wasn't about indexes. It was that nobody had run ANALYZE on the table in three months.)

What the query planner does

When you run a query, Postgres doesn't execute it immediately. First, it hands the parsed query to the query planner. The planner's job: find the cheapest way to execute the query.

"Cheapest" means estimated total I/O and CPU cost. The planner generates multiple candidate plans (different join orders, different scan strategies), estimates the cost of each using column statistics stored in pg_statistic, and picks the plan with the lowest estimated cost.

SQL Query
    ↓
Parser         (syntax tree)
    ↓
Rewriter       (apply rules, views)
    ↓
Planner        (generate + cost candidate plans, pick winner)
    ↓
Executor       (run the plan)

The planner is not magic. It estimates based on statistics, and statistics can be wrong. A bad estimate leads to a bad plan, which is the root cause of most "why is this query suddenly slow" incidents.

EXPLAIN vs EXPLAIN ANALYZE

EXPLAIN shows you the plan the planner chose. Estimated costs and row counts, without actually running the query.

EXPLAIN ANALYZE actually executes the query and shows both estimated and actual numbers. This is what you use when debugging: the gap between estimated and actual rows is where bad plans hide.

-- Shows plan only — safe to run on production queries without side effects
EXPLAIN SELECT * FROM orders WHERE user_id = 42;

-- Executes the query AND shows actual timing and row counts
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 42;

-- Most useful flags for debugging
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE user_id = 42;

BUFFERS shows cache hits vs disk reads. This is critical for understanding whether slowness is I/O-bound or CPU-bound. If you see high Buffers: shared read numbers, your working set doesn't fit in shared_buffers and you're hitting disk.

Warning: EXPLAIN ANALYZE runs the query. For DELETEs or UPDATEs, wrap in a transaction and roll back:

BEGIN;
EXPLAIN ANALYZE DELETE FROM logs WHERE created_at < '2025-01-01';
ROLLBACK;

Reading a query plan

A query plan is a tree. Each node is an operation. The innermost nodes run first; results flow upward to the root. Each node shows:

Node Type  (cost=startup..total rows=estimated_rows width=row_bytes)
           (actual time=startup..total rows=actual_rows loops=N)

Seq scan

A full table scan. Reads every row, filters as it goes. Fast when you need most of the table. Slow and alarming when you expected an index scan.

EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 42;

-- Output:
Seq Scan on orders  (cost=0.00..18340.00 rows=1 width=128)
                    (actual time=2.453..234.231 rows=1 loops=1)
  Filter: (user_id = 42)
  Rows Removed by Filter: 450000

450,000 rows scanned to return 1. There's no index on user_id, or the planner decided the index wasn't worth using (see: low selectivity).

Index scan

Uses a B-tree index to find rows, then fetches them from the heap. Fast for selective queries. Each index lookup is a random I/O to the heap — "random" means potentially different pages.

-- With an index on user_id:
Index Scan using orders_user_id_idx on orders  (cost=0.43..8.45 rows=1 width=128)
                                                (actual time=0.042..0.043 rows=1 loops=1)
  Index Cond: (user_id = 42)

Index only scan

All columns needed are in the index — no heap access required. The fastest scan type. Requires the columns in SELECT and WHERE to all be in the index.

-- If index is on (user_id) and we only SELECT user_id:
Index Only Scan using orders_user_id_idx on orders
  (cost=0.43..4.45 rows=1 width=8)
  Heap Fetches: 0  ← no heap I/O

Bitmap heap scan

Two-phase scan. First, a Bitmap Index Scan collects all matching TIDs (row pointers) from the index into a bitmap. Then the Bitmap Heap Scan reads heap pages in physical order (reducing random I/O). Appears when a query matches many rows — more efficient than individual Index Scans at medium selectivity.

Bitmap Heap Scan on orders  (cost=12.34..180.23 rows=450 width=128)
  Recheck Cond: (status = 'pending')
  ->  Bitmap Index Scan on orders_status_idx
        Index Cond: (status = 'pending')

Hash join

For joining two tables. Builds a hash table from the smaller relation, then probes it with the larger. Fast when one side fits in work_mem. If it spills to disk, you'll see "Batches: N" with N > 1 — increase work_mem for that session.

Hash Join  (cost=1234.56..5678.90 rows=450 width=256)
  Hash Cond: (orders.user_id = users.id)
  ->  Seq Scan on orders
  ->  Hash
        Buckets: 1024  Batches: 1  Memory Usage: 48kB  ← fits in memory
        ->  Seq Scan on users

Nested loop

For each row in the outer relation, scan the inner relation. Efficient when the outer set is small and the inner has an index. Terrible when both sides are large. O(n×m) complexity.

Merge join

Requires both sides to be sorted on the join key. Works well when both sides are already sorted (index scans in order) or when sorting them is cheap. Uses a linear merge — O(n+m).

The most important numbers

Rows: estimated vs actual

This is the single most diagnostic number in a query plan. When estimated rows differ wildly from actual rows, the planner made a bad decision based on bad statistics.

-- Estimated: 1 row. Actual: 45,000 rows.
-- The planner expected a tiny result and chose a Nested Loop.
-- With 45,000 rows, a Hash Join would have been 100x faster.

Nested Loop  (cost=0.43..890.12 rows=1 width=256)
             (actual time=0.05..8234.12 rows=45000 loops=1)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ← catastrophic estimate miss

Actual time

Format: actual time=startup_ms..total_ms. startup is time to return the first row; total is time to return all rows. The root node's total time is your query's wall clock time.

Loops

In a Nested Loop, inner nodes execute once per outer row. A node showing loops=450 ran 450 times. The actual time shown is per loop — multiply by loops for total time. A slow inner node in a loop is magnified.

-- This looks fast — 2ms
Index Scan ... (actual time=0.01..2.00 rows=10 loops=450)

-- But it ran 450 times — total: 900ms just for this node
-- Total rows: 450 × 10 = 4,500

pg_stats and ANALYZE

The planner's estimates come from statistics stored in pg_stats. Postgres collects these by sampling a fraction of each table's rows. The default is 300 sample values per column (default_statistics_target).

-- See what statistics Postgres has for a column
SELECT
  attname,
  n_distinct,
  correlation,
  most_common_vals,
  most_common_freqs
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';

ANALYZE refreshes these statistics. Autovacuum runs it automatically, but if you just loaded a million rows or your data distribution changed significantly, run it manually:

ANALYZE orders;              -- update stats for the whole table
ANALYZE orders (user_id);    -- update stats for one column only

For skewed data (columns where a few values appear very frequently), increase the statistics target for that column to give the planner more detail:

-- Collect 1000 sample values for status column instead of default 300
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders;

Common bad plans and how to fix them

Sequential scan on an indexed column

The planner chose a Seq Scan despite an index existing. Two common reasons:

1. Low selectivity. If status = 'active' matches 80% of the table, using the index is slower than a sequential scan. The planner is correct to ignore it. Solution: partial indexes on the minority value.

-- Index only the rare status values
CREATE INDEX orders_status_pending_idx ON orders (status)
WHERE status IN ('pending', 'processing');

2. Stale statistics. The planner thinks the column has low selectivity because stats haven't been updated since a data shape change. Run ANALYZE.

Implicit type cast killing index use

This is a common trap. An index on user_id INT won't be used if you query with a string literal. Postgres casts the column, not the literal, making the index expression non-matching.

-- user_id is INT, but we pass a string — index won't be used
SELECT * FROM orders WHERE user_id = '42';  -- ❌ implicit cast on column

-- Correct: pass the right type
SELECT * FROM orders WHERE user_id = 42;    -- ✅ index used

-- Or: check your ORM isn't sending strings for integer columns
-- In psycopg2: cursor.execute("... WHERE user_id = %s", [42])  -- 42 as int, not str

Bad join order

The planner joins tables in the order it estimates produces the smallest intermediate result first. If statistics are wrong, it might join a large table first, creating a huge intermediate. You can force join order as a diagnostic:

SET join_collapse_limit = 1;  -- disables join reordering, uses query order
EXPLAIN ANALYZE
SELECT * FROM orders o
JOIN users u ON u.id = o.user_id
JOIN products p ON p.id = o.product_id
WHERE o.status = 'pending';

Debugging with enable_seqscan = off

When you're trying to understand why Postgres isn't using an index you expect it to, turn off sequential scans for the session. This forces the planner to use indexes, which is helpful for seeing what plan it would generate if it did:

SET enable_seqscan = off;
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 42;
-- Now see what the index scan plan costs vs what the seq scan would have been
SET enable_seqscan = on;  -- always restore

Never leave this set in production. Use it only as a diagnostic in a session.

Finding slow queries in production

pg_stat_statements

The most important extension for production Postgres. It tracks execution statistics for every query pattern across the whole database.

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Top 10 queries by total time
SELECT
  query,
  calls,
  total_exec_time / 1000 AS total_sec,
  mean_exec_time AS mean_ms,
  rows,
  stddev_exec_time AS stddev_ms
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- Queries with worst average latency (min 100 calls)
SELECT
  query,
  calls,
  mean_exec_time AS mean_ms,
  max_exec_time AS max_ms
FROM pg_stat_statements
WHERE calls > 100
ORDER BY mean_exec_time DESC
LIMIT 10;

The output normalizes query parameters (WHERE id = $1) so all calls to the same query pattern are aggregated, regardless of the parameter value. This is what makes it useful for spotting the one query that runs 50,000 times a day with a 40ms average.

auto_explain for logging slow query plans

auto_explain logs the full EXPLAIN ANALYZE output for any query that exceeds a time threshold, without any application changes. Invaluable for catching slow queries in production without manually adding EXPLAIN everywhere.

-- Enable auto_explain (in postgresql.conf or per-session)
LOAD 'auto_explain';
SET auto_explain.log_min_duration = 1000;  -- log plans for queries > 1 second
SET auto_explain.log_analyze = true;
SET auto_explain.log_buffers = true;

-- The next slow query will write its full plan to the Postgres log
-- Check: tail -f /var/log/postgresql/postgresql.log

To enable globally without a restart (Postgres 14+):

ALTER SYSTEM SET session_preload_libraries = 'auto_explain';
ALTER SYSTEM SET auto_explain.log_min_duration = '1s';
SELECT pg_reload_conf();

Quiz

Q1. EXPLAIN shows rows=1 estimated but EXPLAIN ANALYZE shows rows=45000 actual. What's the most likely cause and fix?

Stale or insufficient statistics. The planner sampled the table when the column had a very different distribution — perhaps when the table had far fewer rows, or the column's value distribution changed. With only 1 row expected, the planner probably chose a Nested Loop join or an Index Scan that performs poorly at 45,000 rows.

Fix: Run ANALYZE tablename to refresh statistics. If the distribution is highly skewed (a few values are very common), increase default_statistics_target for that column: ALTER TABLE t ALTER COLUMN c SET STATISTICS 500; ANALYZE t;

Q2. You add an index on email VARCHAR but queries with WHERE LOWER(email) = 'user@example.com' still do Seq Scans. Why?

The index is on email, not on LOWER(email). When you apply a function to an indexed column in the WHERE clause, Postgres can't use the standard index — it would need to apply LOWER() to every indexed value to compare, defeating the index.

Fix: Create an expression index: CREATE INDEX email_lower_idx ON users (LOWER(email)); Now queries with WHERE LOWER(email) = ... will use this index directly.

Q3. A Hash Join node shows Batches: 8. What does this mean and how do you fix it?

The hash table spilled to disk. When the hash table for a join doesn't fit in work_mem, Postgres splits it into batches and writes partial hash tables to disk. Batches: 8 means 8 disk writes and reads — this is very slow.

Fix: Increase work_mem for the session running the query: SET work_mem = '256MB'; EXPLAIN ANALYZE .... Check if Batches drops to 1. Don't increase it globally unless you're sure about memory impact — work_mem applies per sort/hash operation, not per connection, and one complex query can use it multiple times.

Q4. What's the difference between pg_stat_statements and auto_explain? When do you use each?

pg_stat_statements: Aggregates query statistics over time. You use it for identifying which queries are consuming the most resources — total time, call count, average latency. It doesn't show you the query plan, just the metrics. Use it first to find the culprits.

auto_explain: Logs the full EXPLAIN ANALYZE plan for individual query executions that exceed a time threshold. You use it after you've identified a slow query pattern to see exactly what plan it ran and where the time was spent. The output is one plan per execution, written to the server log.

Typical workflow: pg_stat_statements to find slow query patterns → auto_explain (or manual EXPLAIN ANALYZE) to diagnose the plan → fix the index/statistics/query → verify with EXPLAIN ANALYZE again.

Part 8 done. Next up: Part 9: Vacuum & Autovacuum. Postgres never overwrites rows in place. MVCC means dead tuples accumulate. Ignore VACUUM and your database bloats, slows, and eventually hits transaction ID wraparound.

← PREV

Postgres Internals 107: Transactions & Concurrency