Staff Prep 04: Scaling Postgres Reads — Replicas, Pooling & Caching
ArchitectureStaff

Staff Prep 04: Scaling Postgres Reads — Replicas, Pooling & Caching

April 4, 20269 min readPART 06 / 19

Back to Part 03: Indexing Strategy. Your Postgres primary is at 80% CPU. You have good indexes. Queries are optimised. The problem is pure volume — too many reads. The solution stack is: read replicas for query offload, connection pooling to prevent connection exhaustion, and application-level caching for the hottest paths. Let us build through each layer.

Read replicas: the basics and the gotchas

A read replica is a streaming copy of the primary. Postgres ships WAL (Write-Ahead Log) records to replicas in near-real-time. Replicas apply these records and stay within milliseconds of the primary under normal conditions.

What you can route to a replica:

  • Reporting queries, analytics, aggregations
  • Background job reads that can tolerate slight staleness
  • Paginated list queries for admin dashboards
  • Full-text search queries

What you cannot route to a replica:

  • Anything that must read data immediately after writing it (read-your-writes consistency)
  • Financial calculations requiring exact current state
  • Anything that informs a write decision on the primary
python
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker

# Two engine pools: primary for writes, replica for reads
primary_engine = create_async_engine(
    "postgresql+asyncpg://user:pass@primary-db:5432/mydb",
    pool_size=20,
    max_overflow=10,
)

replica_engine = create_async_engine(
    "postgresql+asyncpg://user:pass@replica-db:5432/mydb",
    pool_size=30,    # replicas can handle more connections
    max_overflow=20,
)

PrimarySession = sessionmaker(primary_engine, class_=AsyncSession, expire_on_commit=False)
ReplicaSession = sessionmaker(replica_engine, class_=AsyncSession, expire_on_commit=False)

# FastAPI dependency injection
async def get_db_write():
    async with PrimarySession() as session:
        yield session

async def get_db_read():
    async with ReplicaSession() as session:
        yield session

Replication lag: measuring and handling it

sql
-- On the primary: check replication lag per replica
SELECT
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    (sent_lsn - replay_lsn) AS replication_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;

-- On the replica: check how far behind it is
SELECT
    NOW() - pg_last_xact_replay_timestamp() AS replication_delay,
    pg_is_in_recovery() AS is_replica;

If replication_delay is consistently above 1–2 seconds, the replica is falling behind. Causes: replica CPU bound, network latency, or the primary has a very high WAL generation rate. At high lag, consider routing more critical reads back to the primary temporarily.

Connection pooling: why it matters

Postgres allocates a separate OS process per connection. Each process uses ~5–10 MB of RAM. At 500 connections: 2.5–5 GB RAM just for connection processes, before your data even gets cached. More critically, Postgres throughput does not scale linearly with connections — past ~100 active connections, throughput can actually decrease due to lock contention on internal structures.

The solution: a connection pool that multiplexes many application connections onto a smaller number of Postgres server connections. PgBouncer is the standard.

PgBouncer modes: the one decision that matters most

PgBouncer has three pooling modes. Choosing wrong is a production incident waiting to happen.

Session mode: One server connection per client connection, held for the entire client session. Essentially no benefit over no pooler — same connection count, just adds a hop. Use only if you need PgBouncer for other reasons (auth, load distribution).

Transaction mode: A server connection is assigned when a transaction starts and released when it commits or rolls back. The right choice for most applications. 100 app connections can share 20 server connections. Major caveat: prepared statements, advisory locks, and SET LOCAL are broken in transaction mode. Use bind parameters instead of prepared statements.

Statement mode: A server connection is released after every single statement. Maximum multiplexing. Problem: multi-statement transactions are impossible. Rarely usable in practice.

ini
; pgbouncer.ini — transaction mode configuration
[databases]
mydb = host=primary-db port=5432 dbname=mydb

[pgbouncer]
pool_mode = transaction          ; transaction mode for best multiplexing
max_client_conn = 1000           ; app can open up to 1000 connections to pgbouncer
default_pool_size = 25           ; only 25 actual Postgres server connections
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3

; Authentication
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt

; Keepalive
server_idle_timeout = 600
client_idle_timeout = 0

Connection pool sizing math

The classic formula from the Hikari CP team (and endorsed by Postgres core devs):

pool_size = (num_cores * 2) + effective_spindle_count

For a 4-core Postgres server with SSDs (effective_spindle_count = 1): pool_size = (4 * 2) + 1 = 9. That seems low, but it is often right for CPU-bound workloads. For I/O-bound workloads with slow disks, you can go higher. Most production Postgres deployments perform optimally at 20–50 server connections, not 200.

python
from sqlalchemy.ext.asyncio import create_async_engine

# Connect through PgBouncer in transaction mode
engine = create_async_engine(
    "postgresql+asyncpg://user:pass@pgbouncer:6432/mydb",
    pool_size=10,         # SQLAlchemy pool (connections to PgBouncer)
    max_overflow=20,
    pool_pre_ping=True,   # verify connections before use
    pool_recycle=3600,    # recycle connections hourly (avoid stale ones)
    # Disable prepared statements (incompatible with PgBouncer transaction mode)
    connect_args={"prepared_statement_cache_size": 0},
)

Query result caching: the last mile

Even with replicas and PgBouncer, some queries are too frequent or too expensive to hit the database every time. Application-level caching (Redis) is the final layer.

python
import json
import hashlib
import redis.asyncio as redis

r = redis.Redis(host="redis", port=6379, decode_responses=True)

def cache_key(sql: str, params: dict) -> str:
    payload = json.dumps({"sql": sql, "params": params}, sort_keys=True)
    return "qcache:" + hashlib.md5(payload.encode()).hexdigest()

async def cached_query(session, sql: str, params: dict, ttl: int = 60):
    key = cache_key(sql, params)

    cached = await r.get(key)
    if cached:
        return json.loads(cached)

    result = await session.execute(sql, params)
    rows = [dict(row) for row in result.fetchall()]

    await r.setex(key, ttl, json.dumps(rows))
    return rows

Quiz: test your understanding

Before moving on, answer these in your head (or out loud):

  1. You add a read replica, but users report seeing stale data immediately after updating their profile. What is happening and how do you fix it?
  2. What is the key difference between PgBouncer transaction mode and session mode? Why is transaction mode preferred?
  3. Your app uses prepared statements via SQLAlchemy. You want to add PgBouncer in transaction mode. What breaks and how do you fix it?
  4. Using the Hikari formula, calculate the optimal pool size for a 16-core Postgres server with NVMe SSDs. What would you adjust if the workload is read-heavy with long-running analytics queries?
  5. When does adding more replicas stop helping with read scaling? What other bottlenecks emerge?

Next up — Part 05: Scaling Writes. Batch inserts, UPSERT patterns, write buffering, and when partitioning actually helps writes.

← PREV
Postgres Internals 102: Views & Materialized Views
← All Architecture Posts