Staff Prep 04: Scaling Postgres Reads — Replicas, Pooling & Caching
Back to Part 03: Indexing Strategy. Your Postgres primary is at 80% CPU. You have good indexes. Queries are optimised. The problem is pure volume — too many reads. The solution stack is: read replicas for query offload, connection pooling to prevent connection exhaustion, and application-level caching for the hottest paths. Let us build through each layer.
Read replicas: the basics and the gotchas
A read replica is a streaming copy of the primary. Postgres ships WAL (Write-Ahead Log) records to replicas in near-real-time. Replicas apply these records and stay within milliseconds of the primary under normal conditions.
What you can route to a replica:
- Reporting queries, analytics, aggregations
- Background job reads that can tolerate slight staleness
- Paginated list queries for admin dashboards
- Full-text search queries
What you cannot route to a replica:
- Anything that must read data immediately after writing it (read-your-writes consistency)
- Financial calculations requiring exact current state
- Anything that informs a write decision on the primary
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker
# Two engine pools: primary for writes, replica for reads
primary_engine = create_async_engine(
"postgresql+asyncpg://user:pass@primary-db:5432/mydb",
pool_size=20,
max_overflow=10,
)
replica_engine = create_async_engine(
"postgresql+asyncpg://user:pass@replica-db:5432/mydb",
pool_size=30, # replicas can handle more connections
max_overflow=20,
)
PrimarySession = sessionmaker(primary_engine, class_=AsyncSession, expire_on_commit=False)
ReplicaSession = sessionmaker(replica_engine, class_=AsyncSession, expire_on_commit=False)
# FastAPI dependency injection
async def get_db_write():
async with PrimarySession() as session:
yield session
async def get_db_read():
async with ReplicaSession() as session:
yield session
Replication lag: measuring and handling it
-- On the primary: check replication lag per replica
SELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
(sent_lsn - replay_lsn) AS replication_lag_bytes,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication;
-- On the replica: check how far behind it is
SELECT
NOW() - pg_last_xact_replay_timestamp() AS replication_delay,
pg_is_in_recovery() AS is_replica;
If replication_delay is consistently above 1–2 seconds, the replica is falling behind.
Causes: replica CPU bound, network latency, or the primary has a very high WAL generation rate.
At high lag, consider routing more critical reads back to the primary temporarily.
Connection pooling: why it matters
Postgres allocates a separate OS process per connection. Each process uses ~5–10 MB of RAM. At 500 connections: 2.5–5 GB RAM just for connection processes, before your data even gets cached. More critically, Postgres throughput does not scale linearly with connections — past ~100 active connections, throughput can actually decrease due to lock contention on internal structures.
The solution: a connection pool that multiplexes many application connections onto a smaller number of Postgres server connections. PgBouncer is the standard.
PgBouncer modes: the one decision that matters most
PgBouncer has three pooling modes. Choosing wrong is a production incident waiting to happen.
Session mode: One server connection per client connection, held for the entire client session. Essentially no benefit over no pooler — same connection count, just adds a hop. Use only if you need PgBouncer for other reasons (auth, load distribution).
Transaction mode: A server connection is assigned when a transaction starts and released when it commits or rolls back. The right choice for most applications. 100 app connections can share 20 server connections. Major caveat: prepared statements, advisory locks, and SET LOCAL are broken in transaction mode. Use bind parameters instead of prepared statements.
Statement mode: A server connection is released after every single statement. Maximum multiplexing. Problem: multi-statement transactions are impossible. Rarely usable in practice.
; pgbouncer.ini — transaction mode configuration
[databases]
mydb = host=primary-db port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction ; transaction mode for best multiplexing
max_client_conn = 1000 ; app can open up to 1000 connections to pgbouncer
default_pool_size = 25 ; only 25 actual Postgres server connections
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3
; Authentication
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt
; Keepalive
server_idle_timeout = 600
client_idle_timeout = 0
Connection pool sizing math
The classic formula from the Hikari CP team (and endorsed by Postgres core devs):
pool_size = (num_cores * 2) + effective_spindle_count
For a 4-core Postgres server with SSDs (effective_spindle_count = 1):
pool_size = (4 * 2) + 1 = 9. That seems low, but it is often right for CPU-bound
workloads. For I/O-bound workloads with slow disks, you can go higher. Most production Postgres
deployments perform optimally at 20–50 server connections, not 200.
from sqlalchemy.ext.asyncio import create_async_engine
# Connect through PgBouncer in transaction mode
engine = create_async_engine(
"postgresql+asyncpg://user:pass@pgbouncer:6432/mydb",
pool_size=10, # SQLAlchemy pool (connections to PgBouncer)
max_overflow=20,
pool_pre_ping=True, # verify connections before use
pool_recycle=3600, # recycle connections hourly (avoid stale ones)
# Disable prepared statements (incompatible with PgBouncer transaction mode)
connect_args={"prepared_statement_cache_size": 0},
)
Query result caching: the last mile
Even with replicas and PgBouncer, some queries are too frequent or too expensive to hit the database every time. Application-level caching (Redis) is the final layer.
import json
import hashlib
import redis.asyncio as redis
r = redis.Redis(host="redis", port=6379, decode_responses=True)
def cache_key(sql: str, params: dict) -> str:
payload = json.dumps({"sql": sql, "params": params}, sort_keys=True)
return "qcache:" + hashlib.md5(payload.encode()).hexdigest()
async def cached_query(session, sql: str, params: dict, ttl: int = 60):
key = cache_key(sql, params)
cached = await r.get(key)
if cached:
return json.loads(cached)
result = await session.execute(sql, params)
rows = [dict(row) for row in result.fetchall()]
await r.setex(key, ttl, json.dumps(rows))
return rows
Quiz: test your understanding
Before moving on, answer these in your head (or out loud):
- You add a read replica, but users report seeing stale data immediately after updating their profile. What is happening and how do you fix it?
- What is the key difference between PgBouncer transaction mode and session mode? Why is transaction mode preferred?
- Your app uses prepared statements via SQLAlchemy. You want to add PgBouncer in transaction mode. What breaks and how do you fix it?
- Using the Hikari formula, calculate the optimal pool size for a 16-core Postgres server with NVMe SSDs. What would you adjust if the workload is read-heavy with long-running analytics queries?
- When does adding more replicas stop helping with read scaling? What other bottlenecks emerge?
Next up — Part 05: Scaling Writes. Batch inserts, UPSERT patterns, write buffering, and when partitioning actually helps writes.