ArchitectureStaff

Staff Prep 04: Scaling Postgres Reads, Replicas, Pooling & Caching

April 4, 20269 min readPART 06 / 19

After Part 03: Indexing Strategy, imagine your Postgres primary is sitting at 80% CPU. You already have good indexes, queries are optimised, the problem is pure volume. Too many reads. The stack I reach for: read replicas to offload queries, connection pooling so you don't exhaust the primary, and application-level caching on the hottest paths.

Read replicas: the basics and the gotchas

A read replica is a streaming copy of the primary. Postgres ships WAL (Write-Ahead Log) records to replicas in near-real-time. Replicas apply these records and stay within milliseconds of the primary under normal conditions.

What you can route to a replica:

Reporting queries, analytics, aggregations
Background job reads that can tolerate slight staleness
Paginated list queries for admin dashboards
Full-text search queries

What you cannot route to a replica:

Anything that must read data immediately after writing it (read-your-writes consistency)
Financial calculations requiring exact current state
Anything that informs a write decision on the primary

python

from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker

# Two engine pools: primary for writes, replica for reads
primary_engine = create_async_engine(
    "postgresql+asyncpg://user:pass@primary-db:5432/mydb",
    pool_size=20,
    max_overflow=10,
)

replica_engine = create_async_engine(
    "postgresql+asyncpg://user:pass@replica-db:5432/mydb",
    pool_size=30,    # replicas can handle more connections
    max_overflow=20,
)

PrimarySession = sessionmaker(primary_engine, class_=AsyncSession, expire_on_commit=False)
ReplicaSession = sessionmaker(replica_engine, class_=AsyncSession, expire_on_commit=False)

# FastAPI dependency injection
async def get_db_write():
    async with PrimarySession() as session:
        yield session

async def get_db_read():
    async with ReplicaSession() as session:
        yield session

Replication lag: measuring and handling it

sql

-- On the primary: check replication lag per replica
SELECT
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    (sent_lsn - replay_lsn) AS replication_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;

-- On the replica: check how far behind it is
SELECT
    NOW() - pg_last_xact_replay_timestamp() AS replication_delay,
    pg_is_in_recovery() AS is_replica;

If replication_delay is consistently above a second or two, your replica is falling behind. Usual suspects: replica CPU bound, network latency, or a primary with a very high WAL generation rate. At high lag, route the critical reads back to the primary temporarily while you diagnose.

Connection pooling: why it matters

Postgres allocates a separate OS process per connection. Each process uses around 5 to 10 MB of RAM. At 500 connections that's 2.5 to 5 GB of RAM just for connection processes, before your data even gets cached. More critically, Postgres throughput doesn't scale linearly with connections. Past about 100 active connections, throughput can actually decrease due to lock contention on internal structures.

The fix is a connection pool that multiplexes many application connections onto a smaller number of Postgres server connections. PgBouncer is the standard. I've tried pgcat and Odyssey in passing; PgBouncer is still where I land by default.

PgBouncer modes: the one decision that matters most

PgBouncer has three pooling modes. Choosing wrong is a production incident waiting to happen.

Session mode: One server connection per client connection, held for the entire client session. Essentially no benefit over no pooler — same connection count, just adds a hop. Use only if you need PgBouncer for other reasons (auth, load distribution).

Transaction mode: A server connection is assigned when a transaction starts and released when it commits or rolls back. The right choice for most applications. 100 app connections can share 20 server connections. Major caveat: prepared statements, advisory locks, and SET LOCAL are broken in transaction mode. Use bind parameters instead of prepared statements.

Statement mode: A server connection is released after every single statement. Maximum multiplexing. Problem: multi-statement transactions are impossible. Rarely usable in practice.

ini

; pgbouncer.ini — transaction mode configuration
[databases]
mydb = host=primary-db port=5432 dbname=mydb

[pgbouncer]
pool_mode = transaction          ; transaction mode for best multiplexing
max_client_conn = 1000           ; app can open up to 1000 connections to pgbouncer
default_pool_size = 25           ; only 25 actual Postgres server connections
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3

; Authentication
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt

; Keepalive
server_idle_timeout = 600
client_idle_timeout = 0

Connection pool sizing math

The classic formula from the Hikari CP team (and endorsed by Postgres core devs):

pool_size = (num_cores * 2) + effective_spindle_count

For a 4-core Postgres server with SSDs (effective_spindle_count = 1): pool_size = (4 * 2) + 1 = 9. That seems low, but it is often right for CPU-bound workloads. For I/O-bound workloads with slow disks, you can go higher. Most production Postgres deployments perform optimally at 20–50 server connections, not 200.

python

from sqlalchemy.ext.asyncio import create_async_engine

# Connect through PgBouncer in transaction mode
engine = create_async_engine(
    "postgresql+asyncpg://user:pass@pgbouncer:6432/mydb",
    pool_size=10,         # SQLAlchemy pool (connections to PgBouncer)
    max_overflow=20,
    pool_pre_ping=True,   # verify connections before use
    pool_recycle=3600,    # recycle connections hourly (avoid stale ones)
    # Disable prepared statements (incompatible with PgBouncer transaction mode)
    connect_args={"prepared_statement_cache_size": 0},
)

Query result caching: the last mile

Even with replicas and PgBouncer, some queries are too frequent or too expensive to hit the database every time. Application-level caching (Redis) is the final layer.

python

import json
import hashlib
import redis.asyncio as redis

r = redis.Redis(host="redis", port=6379, decode_responses=True)

def cache_key(sql: str, params: dict) -> str:
    payload = json.dumps({"sql": sql, "params": params}, sort_keys=True)
    return "qcache:" + hashlib.md5(payload.encode()).hexdigest()

async def cached_query(session, sql: str, params: dict, ttl: int = 60):
    key = cache_key(sql, params)

    cached = await r.get(key)
    if cached:
        return json.loads(cached)

    result = await session.execute(sql, params)
    rows = [dict(row) for row in result.fetchall()]

    await r.setex(key, ttl, json.dumps(rows))
    return rows

Quiz: test your understanding

Before moving on, answer these in your head (or out loud):

You add a read replica, but users report seeing stale data immediately after updating their profile. What is happening and how do you fix it?
What is the key difference between PgBouncer transaction mode and session mode? Why is transaction mode preferred?
Your app uses prepared statements via SQLAlchemy. You want to add PgBouncer in transaction mode. What breaks and how do you fix it?
Using the Hikari formula, calculate the optimal pool size for a 16-core Postgres server with NVMe SSDs. What would you adjust if the workload is read-heavy with long-running analytics queries?
When does adding more replicas stop helping with read scaling? What other bottlenecks emerge?

Next up — Part 05: Scaling Writes. Batch inserts, UPSERT patterns, write buffering, and when partitioning actually helps writes.

← PREV

Postgres Internals 102: Views & Materialized Views