Postgres Internals 107: Transactions & Concurrency
PostgresProduction

Postgres Internals 107: Transactions & Concurrency

March 29, 202612 min readPART 14 / 19

In Part 6 we pushed write throughput as far as a single primary can go — batching, partitioning, UPSERT. But throughput is only half the story. When multiple transactions touch the same data simultaneously, you need guarantees about what each one sees. This is Part 7: how Postgres handles concurrency without letting reads block writes, what each isolation level actually protects against, and the production bugs that happen when you misunderstand locking.

Every statement is a transaction

This is the first thing to internalize: in Postgres, there is no such thing as a statement that runs outside a transaction. Every single statement — even a bare SELECT — runs inside one. If you don't explicitly start a transaction, Postgres wraps the statement in an implicit one and commits it immediately.

-- These two are equivalent:
SELECT * FROM orders WHERE id = 42;

BEGIN;
SELECT * FROM orders WHERE id = 42;
COMMIT;

The implication: autocommit mode (the default in most drivers) commits after every statement. This is usually what you want for standalone reads and writes, but it means that if you do two updates "as a unit" without an explicit BEGIN, they are not atomic. One can succeed while the other fails.

-- ❌ NOT atomic — if the UPDATE fails, the INSERT already committed
INSERT INTO orders (user_id, total) VALUES (42, 99.00);
UPDATE user_balance SET balance = balance - 99.00 WHERE user_id = 42;

-- ✅ Atomic — both succeed or both roll back
BEGIN;
INSERT INTO orders (user_id, total) VALUES (42, 99.00);
UPDATE user_balance SET balance = balance - 99.00 WHERE user_id = 42;
COMMIT;

ACID — what it actually means in Postgres

ACID is usually explained in textbook terms. Here's what each letter means in practice:

Atomicity: A transaction is all-or-nothing. If any statement inside a BEGIN...COMMIT block fails, Postgres rolls back everything — no partial state is ever visible to other transactions. A payment that deducts from one account must also credit the other; atomicity ensures you can't have the deduction without the credit.

Consistency: Postgres enforces constraints (NOT NULL, UNIQUE, FOREIGN KEY, CHECK) at transaction commit time. Within a transaction you can temporarily violate them (using DEFERRABLE constraints), but you cannot commit a state that breaks the schema's rules. This is the database's promise that it won't let you put garbage in.

Isolation: Concurrent transactions behave as if they ran serially. In practice, Postgres allows tunable levels of isolation (discussed below) — full isolation is expensive, so you choose how much you need. This is where most production bugs live.

Durability: Once Postgres says COMMIT, the data is on disk. Postgres uses WAL (Write-Ahead Logging) — every change is written to a sequential log before it touches the heap. If the server crashes after a COMMIT, the WAL replay recovers the write. If it crashes before COMMIT, the write never happened.

MVCC: why readers don't block writers

Most databases use a locking model where a write locks the row and readers must wait. Postgres uses MVCC — Multi-Version Concurrency Control. The core insight: instead of blocking readers, Postgres keeps multiple versions of each row simultaneously.

When you UPDATE a row, Postgres does not overwrite it. It writes a new version of the row (called a tuple) and marks the old version as "deleted as of transaction X." The old version stays in the heap until VACUUM cleans it up. Each transaction sees a snapshot of the database as it existed when the transaction started — old versions are visible to older transactions, new versions to newer ones.

Heap page for row id=42:

Tuple 1: user_id=42, balance=500.00  [xmin=100, xmax=205]  ← old version (deleted by txn 205)
Tuple 2: user_id=42, balance=401.00  [xmin=205, xmax=0]    ← current version (inserted by txn 205)

A transaction with snapshot timestamp 150 will still see balance=500.00 even after transaction 205 commits. A transaction with snapshot timestamp 210 sees balance=401.00. Neither blocks the other. This is why Postgres can sustain high concurrent read/write workloads without readers and writers serializing on locks.

The tradeoff: dead tuples accumulate. VACUUM is the mechanism that reclaims them (covered in Part 9).

Isolation levels

MVCC gives you isolation, but how much? Postgres offers three isolation levels you'll actually use. Each one prevents a progressively stronger set of anomalies.

READ COMMITTED (default)

Each statement inside the transaction sees a fresh snapshot — committed data as of the moment that statement runs. Two reads of the same row within one transaction can see different values if another transaction commits between them.

-- Transaction A
BEGIN;
SELECT balance FROM accounts WHERE id = 1;  -- sees 500

-- Transaction B (commits in between)
UPDATE accounts SET balance = 300 WHERE id = 1;
COMMIT;

-- Transaction A — same transaction, new snapshot per statement
SELECT balance FROM accounts WHERE id = 1;  -- now sees 300
COMMIT;

This is a non-repeatable read. For most web app reads this is fine. For financial logic where you read a value and then act on it, it's a bug waiting to happen.

REPEATABLE READ

The snapshot is taken once — at the start of the transaction — and held for all statements. Reads within the same transaction always see the same data, regardless of concurrent commits.

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN;
SELECT balance FROM accounts WHERE id = 1;  -- sees 500
-- Even if another transaction commits balance=300 here...
SELECT balance FROM accounts WHERE id = 1;  -- still sees 500
COMMIT;

This prevents non-repeatable reads. It also prevents phantom reads in Postgres (unlike the SQL standard, Postgres's REPEATABLE READ is stronger than spec). Use this when you need a consistent view across multiple queries in one logical operation — generating a report, running a multi-step calculation.

SERIALIZABLE

The strongest guarantee. Postgres tracks dependencies between transactions and ensures the final state is equivalent to some serial execution. It uses SSI (Serializable Snapshot Isolation) — not locking. Transactions that would produce anomalies are aborted with a serialization failure error.

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN;
-- If Postgres detects that committing this transaction would produce
-- a result inconsistent with any serial ordering, it will abort:
-- ERROR: could not serialize access due to read/write dependencies among transactions

Serializable is for operations where correctness is non-negotiable and anomalies under REPEATABLE READ are possible — think double-spend prevention, seat reservation, inventory allocation. The cost: higher abort rate under contention. Your application must be prepared to retry serialization failures.

Anomaly summary

Isolation Level   | Dirty Read | Non-Repeatable | Phantom Read
------------------+------------+----------------+-------------
READ COMMITTED    | Prevented  | Possible       | Possible
REPEATABLE READ   | Prevented  | Prevented      | Prevented*
SERIALIZABLE      | Prevented  | Prevented      | Prevented
* Postgres's REPEATABLE READ prevents phantoms; SQL standard does not require this

Practical locking

MVCC handles read/write concurrency. But sometimes you need explicit locks — to prevent a row from being modified by another transaction while you're making a decision based on its value.

SELECT FOR UPDATE

The most common explicit lock. It reads a row AND locks it. Any other transaction trying to SELECT FOR UPDATE or modify the same row will block until you release the lock (commit or rollback).

-- Process a job — mark it as processing, preventing another worker from grabbing it
BEGIN;
SELECT id, payload
FROM job_queue
WHERE status = 'pending'
ORDER BY created_at
LIMIT 1
FOR UPDATE SKIP LOCKED;  -- skip rows already locked by other workers

UPDATE job_queue SET status = 'processing' WHERE id = $1;
COMMIT;

SKIP LOCKED is the key for job queues — instead of blocking on locked rows, skip them and find the next available one. Without it, all workers queue up on the same row.

SELECT FOR SHARE

A lighter lock. Multiple transactions can hold a FOR SHARE lock on the same row simultaneously. It prevents updates but allows concurrent reads. Use when you're validating a row before inserting a dependent row (e.g., checking a parent record exists before inserting a child).

-- Validate user exists and lock it — prevents deletion while we insert the order
BEGIN;
SELECT id FROM users WHERE id = $1 FOR SHARE;
INSERT INTO orders (user_id, total) VALUES ($1, $2);
COMMIT;

Advisory locks

Application-level locks that have nothing to do with table rows. They're cooperative — Postgres just provides the locking primitive; your code decides what the lock "means."

-- Acquire a session-level advisory lock with id 12345
SELECT pg_advisory_lock(12345);

-- Do something exclusive...
UPDATE config SET value = 'new' WHERE key = 'feature_flag';

-- Release
SELECT pg_advisory_unlock(12345);

-- Or: transaction-level (auto-released on commit/rollback)
SELECT pg_advisory_xact_lock(12345);

Advisory locks are great for distributed "only one process should do X at a time" problems — cron jobs, migrations, leader election in a multi-process app. They don't interfere with table-level MVCC at all.

Table-level locks

Usually acquired implicitly. ALTER TABLE takes an ACCESS EXCLUSIVE lock — the heaviest lock, blocks everything. INSERT/UPDATE/DELETE take ROW EXCLUSIVE. SELECT takes no lock (MVCC). Explicit table locks are rare in application code but common in migrations:

-- Explicit table lock (very rarely needed in application code)
BEGIN;
LOCK TABLE accounts IN SHARE ROW EXCLUSIVE MODE;
-- Now safe to do bulk updates without other transactions modifying the table
COMMIT;

Deadlocks

A deadlock happens when two transactions each hold a lock that the other wants. Neither can proceed. Postgres detects this automatically (checks for cycles in the lock wait graph every second by default) and terminates one transaction with:

ERROR:  deadlock detected
DETAIL:  Process 12345 waits for ShareLock on transaction 67890;
         blocked by process 67890.
         Process 67890 waits for ShareLock on transaction 12345;
         blocked by process 12345.
HINT:   See server log for query details.

The classic scenario:

-- Transaction A                    -- Transaction B
BEGIN;                               BEGIN;
UPDATE accounts SET ...              UPDATE accounts SET ...
  WHERE id = 1;  -- locks row 1       WHERE id = 2;  -- locks row 2

UPDATE accounts SET ...              UPDATE accounts SET ...
  WHERE id = 2;  -- waits for B        WHERE id = 1;  -- waits for A
                                     -- DEADLOCK

How to prevent deadlocks

Always acquire locks in the same order. This is the golden rule. If every transaction that touches multiple rows sorts them by primary key and acquires locks in that order, cycles are impossible.

# ❌ Deadlock-prone — different order per transaction
async def transfer(from_id, to_id, amount):
    await conn.execute("UPDATE accounts SET balance = balance - $1 WHERE id = $2", amount, from_id)
    await conn.execute("UPDATE accounts SET balance = balance + $1 WHERE id = $2", amount, to_id)

# ✅ Deadlock-safe — always lock lower id first
async def transfer(from_id, to_id, amount):
    first_id, second_id = sorted([from_id, to_id])
    # acquire in consistent order regardless of transfer direction
    await conn.execute("SELECT id FROM accounts WHERE id = ANY($1) FOR UPDATE",
                       [first_id, second_id])
    await conn.execute("UPDATE accounts SET balance = balance - $1 WHERE id = $2", amount, from_id)
    await conn.execute("UPDATE accounts SET balance = balance + $1 WHERE id = $2", amount, to_id)

SAVEPOINT: partial rollbacks

Sometimes you want to handle an error in one part of a transaction without rolling back everything. SAVEPOINT creates a rollback point within a transaction.

BEGIN;

INSERT INTO orders (user_id, total) VALUES (42, 99.00);
SAVEPOINT after_order;

-- Try to apply a discount
UPDATE promotions SET used = true WHERE code = 'SAVE10' AND user_id = 42;

-- If that fails or finds no rows, roll back just to the savepoint
ROLLBACK TO SAVEPOINT after_order;

-- The order insert is still in effect — only the promotion update was rolled back
COMMIT;

SAVEPOINTs are useful in application code when wrapping multiple operations where some are optional or "best effort" — insert the core record, try to update related data, rollback to savepoint if the secondary update fails, commit the primary record regardless.

Production bugs

Long-running transactions holding locks

The most dangerous production pattern. A transaction that runs for minutes (slow query, stuck application, forgotten connection) holds row locks. Every other transaction trying to touch those rows queues up. Your application appears to hang — connection pool exhausts, timeouts cascade.

-- Find long-running transactions (> 5 minutes)
SELECT
  pid,
  now() - pg_stat_activity.xact_start AS duration,
  query,
  state
FROM pg_stat_activity
WHERE (now() - xact_start) > interval '5 minutes'
  AND state != 'idle'
ORDER BY duration DESC;

-- Kill one if needed
SELECT pg_terminate_backend(pid);

Forgetting to commit

In interactive sessions or poorly structured application code, transactions that are opened but never committed hold resources and locks indefinitely. Set idle_in_transaction_session_timeout to automatically kill sessions that have been idle inside a transaction too long:

-- Kill sessions idle in a transaction for more than 30 seconds
ALTER SYSTEM SET idle_in_transaction_session_timeout = '30s';
SELECT pg_reload_conf();

Lock contention under high write load

When many transactions update the same rows, they serialize on row locks even with MVCC. Symptoms: high wait times, pg_stat_activity showing many transactions in lock wait state, throughput plateau despite low CPU.

-- See what's blocked and what's blocking
SELECT
  blocked.pid AS blocked_pid,
  blocked.query AS blocked_query,
  blocking.pid AS blocking_pid,
  blocking.query AS blocking_query
FROM pg_stat_activity AS blocked
JOIN pg_stat_activity AS blocking
  ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE NOT blocked.granted;

Solutions: shorter transactions, lock ordering, optimistic concurrency (version columns + retry on conflict), or restructuring to reduce contention on hot rows.

Quiz

Q1. Two transactions run at READ COMMITTED. Both read stock = 5 for item 99, then both write stock = stock - 1. What's the final stock value?

4, not 3. At READ COMMITTED, each statement gets a fresh snapshot. The second transaction's UPDATE sees the committed result of the first transaction (stock=4) and decrements it to 3. But wait — the UPDATE WHERE clause re-evaluates against the committed row, so both transactions effectively execute SET stock = stock - 1 on the committed value. Final answer: 3.

Actually — this is a lost update scenario at READ COMMITTED. Both transactions read 5, both compute 4, first commits stock=4, second also commits stock=4. Final: 4. You've lost one decrement. Fix: use SELECT FOR UPDATE or rely on UPDATE stock = stock - 1 WHERE stock > 0 which re-reads the committed value at execution time in a row-level lock.

Q2. When does Postgres remove old row versions created by MVCC?

VACUUM (manual or autovacuum) reclaims dead tuples. A tuple is "dead" when no running transaction has a snapshot old enough to need to see it. Postgres tracks the oldest active transaction ID (OldestXmin). Any tuple whose xmax is older than OldestXmin is safe to reclaim.

This is why long-running transactions are dangerous beyond just holding locks — they prevent VACUUM from reclaiming dead tuples, causing table bloat even if they're just idle reads.

Q3. A job queue uses SELECT FOR UPDATE without SKIP LOCKED. You scale to 10 workers. What happens?

All 10 workers queue up on the same row. Each worker runs SELECT ... FOR UPDATE LIMIT 1. Worker 1 acquires the lock. Workers 2–10 all block, waiting for worker 1 to release it. When worker 1 commits, worker 2 acquires the lock — on the same row, now in processing state, so it gets no work. Then worker 3 does the same. You've serialized 10 workers onto one row.

With SKIP LOCKED, each worker skips rows locked by others and immediately grabs the next available row. All 10 workers proceed in parallel.

Q4. What's the difference between REPEATABLE READ and SERIALIZABLE in Postgres?

Both use a single snapshot held for the whole transaction. The difference is in what anomalies they prevent beyond non-repeatable reads and phantoms.

REPEATABLE READ can still have write skew anomalies. Example: two doctors both check "is there at least one doctor on call?" (both see yes), both decide to go off-call. Now there are zero doctors on call. Each transaction read a consistent snapshot, but the combined effect is incorrect.

SERIALIZABLE detects this dependency cycle and aborts one transaction. The application must retry. Serializable prevents all anomalies but has higher abort rates under contention.

Part 7 done. Next up — Part 8: EXPLAIN & Query Planning. When a query is slow, guessing is expensive. EXPLAIN ANALYZE tells you exactly what Postgres is doing, what it expected, and where it went wrong.

← PREV
Postgres Internals 106: Scaling Writes
← All Postgres Posts