Postgres Internals 105: Scaling Reads
A single Postgres instance handles a few hundred connections before memory pressure kicks in. It handles reads fine until reporting queries start competing with user-facing transactions. It handles your current load until it doesn't. This is Part 5 of the Postgres Internals series — how to scale reads with replicas, how to stop killing your database with connection counts, and how to use caching without introducing bugs that are worse than the slowness you fixed.
The problem: one instance can't do everything
A default Postgres installation allows around 100 connections. Each connection consumes roughly 5–10 MB of RAM for its backend process. At 500 connections you're using 2.5–5 GB of RAM just for connection overhead — before a single query runs. And that's before the reads start competing with writes.
The three tools that solve this operate at different layers:
- Read replicas — offload SELECT queries to copies of your database, freeing the primary for writes
- Connection pooling — share a small pool of real Postgres connections across many app connections
- Caching — avoid hitting the database at all for data that doesn't change often
Each layer solves a different bottleneck. Applying the wrong one to the wrong problem costs you time and introduces new failure modes.
Read replicas: streaming replication
Postgres streaming replication works by sending WAL (Write-Ahead Log) records from the primary to one or more standby servers in near real-time. The standby replays those records to keep its data in sync.
┌──────────────┐
Writes ────────►│ Primary │
│ (Postgres) │
└──────┬───────┘
│ WAL stream (async by default)
┌────────────┼────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Replica 1│ │Replica 2│ │Replica 3│
│(reads) │ │(reads) │ │(analytics)│
└─────────┘ └─────────┘ └─────────┘
Replica lag: usually <100ms, can spike during heavy writes
Replicas are read-only — writes fail with an error
Your application needs to route traffic explicitly. Writes go to the primary connection string, reads go to a replica. Most ORMs support this natively (Prisma, Sequelize, SQLAlchemy all have read/write splitting).
// Node.js example — explicit routing
const primary = new Pool({ connectionString: process.env.DATABASE_PRIMARY_URL });
const replica = new Pool({ connectionString: process.env.DATABASE_REPLICA_URL });
// Writes → primary
async function createUser(data: UserInput) {
return primary.query('INSERT INTO users ...', [data]);
}
// Reads → replica
async function getUser(id: string) {
return replica.query('SELECT * FROM users WHERE id = $1', [id]);
}
Replication lag: the silent bug
Replicas are eventually consistent. By default, Postgres replicates asynchronously — the primary commits and confirms to the client before the replica has applied the change. Lag is usually under 100ms. During a large write batch, it can spike to seconds or minutes.
The most common bug this causes: read-your-own-writes failures. User creates a post, you redirect them to the post page, the page reads from a replica that hasn't caught up yet — "Post not found."
// The bug
async function createPostAndRedirect(data: PostInput) {
await primary.query('INSERT INTO posts ...', [data]);
redirect(`/posts/${data.slug}`); // page reads from replica → 404
}
// The fix: read-after-write goes to primary
async function getPostAfterCreate(slug: string) {
// Immediately after a write, read from primary
return primary.query('SELECT * FROM posts WHERE slug = $1', [slug]);
}
// Better: sticky routing for the creating user's session for ~2 seconds
// OR: synchronous replication for critical write paths (performance cost)
Connection pooling: PgBouncer
Every connection to Postgres spawns a backend process. That process holds memory whether it's running a query or idle. A Rails app with 20 Puma workers × 5 threads = 100 connections before you've deployed a second instance. PgBouncer sits between your app and Postgres, maintaining a small pool of real connections and queuing requests.
Without pooling: ────────────────────────────────────────────────── App (500 threads) → 500 Postgres backend processes Each: ~10 MB RAM → 5 GB RAM just for connections With PgBouncer: ────────────────────────────────────────────────── App (500 threads) → PgBouncer → 20 Postgres connections 500 app connections share 20 real connections RAM for connections: ~200 MB
PgBouncer modes — this choice matters
PgBouncer has three pooling modes. Most engineers pick the wrong one and wonder why their transactions are broken.
| Mode | Connection returned after | Safe for | Breaks |
|---|---|---|---|
| Session | Client disconnects | Everything | Nothing — but barely helps |
| Transaction | Transaction ends (COMMIT/ROLLBACK) | Most apps ✅ | Prepared statements, SET commands, advisory locks |
| Statement | Each statement | Almost nothing | Multi-statement transactions, sessions |
Use transaction mode. It's the right default for almost every web application. If your ORM uses prepared statements (Prisma does), disable them at the driver level when using PgBouncer transaction mode — they don't work across connection hand-offs.
# pgbouncer.ini
[databases]
mydb = host=postgres-primary port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000 ; app-facing connections PgBouncer accepts
default_pool_size = 25 ; real Postgres connections per database/user pair
min_pool_size = 5
reserve_pool_size = 5 ; emergency connections for traffic spikes
server_idle_timeout = 600
Caching with Redis
Replicas and pooling make your database faster under load. Caching lets you skip it entirely for data that hasn't changed. The tradeoff is staleness — and getting cache invalidation wrong is how you serve stale prices at checkout, show deleted content, or leak one user's data to another.
Cache-aside pattern
// Cache-aside: the application manages the cache explicitly
async function getUserProfile(userId: string): Promise<User> {
const cacheKey = `user:${userId}`;
// 1. Check cache first
const cached = await redis.get(cacheKey);
if (cached) return JSON.parse(cached);
// 2. Cache miss → hit database
const user = await db.query('SELECT * FROM users WHERE id = $1', [userId]);
// 3. Write to cache with TTL
await redis.setex(cacheKey, 300, JSON.stringify(user)); // 5 min TTL
return user;
}
// On update: invalidate the cache
async function updateUser(userId: string, data: Partial<User>) {
await db.query('UPDATE users SET ... WHERE id = $1', [userId]);
await redis.del(`user:${userId}`); // invalidate
}
What to cache (and what not to)
Good candidates for caching: user profiles, product listings, configuration data, aggregated counts, anything that changes rarely and is read often.
Bad candidates: paginated lists (one insert invalidates every page), anything requiring strong consistency (account balances, inventory counts), data that changes on every request.
// ❌ Don't cache paginated lists — invalidation is a nightmare
const cacheKey = `products:page:${page}:limit:${limit}`;
// Any new product means all pages are stale → cache thrash
// ✅ Cache the individual items, not the list
const cacheKey = `product:${productId}`;
// List queries hit the DB, but individual product fetches are cached
Always set a TTL
A cache key with no TTL lives forever. Your Redis grows unboundedly. Stale data accumulates. Users see a deleted product. A user who changed their name 6 months ago still sees the old one. Always set a TTL, even if it's 24 hours. The right TTL is the longest your business can tolerate stale data for that entity.
Putting it together
PRODUCTION READ SCALING STACK
──────────────────────────────────────────────────────────────
┌──────────────┐
Writes ────────►│ Primary │◄── PgBouncer (transaction mode)
└──────┬───────┘ ▲
│ WAL │ write path
┌────────────┴───────┐ │
▼ ▼ │
┌──────────┐ ┌──────────┐ │
│Replica 1 │ │Replica 2 │ │
└────┬─────┘ └────┬─────┘ │
│ │ │
└────────┬───────────┘ │
▼ │
PgBouncer (reads) │
│ │
┌───────┴────────┐ │
▼ ▼ │
App Server 1 App Server 2 ──────┘
│
▼
Redis ←── cache-aside, TTL on everything
Quiz: scaling reads
Q1. Your app creates a user, then immediately redirects to their profile page. The profile shows "User not found." The INSERT succeeded. What's happening and how do you fix it?
Replication lag. You wrote to the primary, then redirected. The profile page read from a replica that hadn't replicated the new row yet — the replica was behind by even a few milliseconds.
Fix options:
- Read immediately-after-write from the primary instead of the replica
- Implement session stickiness: for 2–3 seconds after a write, route that user's reads to the primary
- Use synchronous replication (
synchronous_commit = on) for critical write paths — but this adds latency
// Simple fix: read-after-write uses primary connection
const user = await primary.query('SELECT * FROM users WHERE id = $1', [newUserId]);
Q2. You configure PgBouncer in statement mode to maximise connection efficiency. Your app uses SQLAlchemy with session.begin() for transactions. What breaks and why?
Multi-statement transactions break entirely. In statement mode, PgBouncer returns the connection to the pool after every single SQL statement. BEGIN runs on connection A. The next statement runs on connection B. The COMMIT runs on connection C. There is no coherent transaction — the database never sees a complete BEGIN...COMMIT on the same connection.
Statement mode also breaks SET commands (session settings reset), prepared statements, advisory locks, and LISTEN/NOTIFY.
Fix: Use transaction mode. It supports BEGIN...COMMIT correctly because the connection is held for the duration of one transaction, then returned.
Q3. Your cache-aside stores user profiles with no TTL. After 6 months, Redis is 40 GB and growing. Only 200k of 2M users are monthly-active. What's wrong?
No TTL means keys never expire. Every user profile ever cached stays in Redis forever. You're caching 2M users but only 200k are active — 1.8M stale cache entries that nobody reads are consuming most of your 40 GB.
Fix: Always set a TTL. For user profiles, something like 30–60 minutes is usually right. Redis will automatically evict the key when it expires, and the next read for that user repopulates it from the database.
// Always include a TTL
await redis.setex(`user:${userId}`, 3600, JSON.stringify(user)); // 1 hour
You can also configure Redis with maxmemory-policy allkeys-lru as a safety net — Redis evicts the least-recently-used keys when memory is full. But TTLs are the real fix.
Q4. Spot the bug: your app routes reads to replicas when the query starts with SELECT. Under what condition does this fail dangerously?
def get_connection(query):
if query.strip().upper().startswith("SELECT"):
return replica_pool.get()
return primary_pool.get()
It fails for read-your-own-writes after any write. After an INSERT or UPDATE, if the very next query is a SELECT (checking the result of the write), this logic routes it to the replica — which may not have replicated the change yet. The SELECT returns stale data or no data.
It also fails for SELECT ... FOR UPDATE — that's a locking read that must run on the primary, but this code sends it to a replica where it either fails or silently acquires no lock.
Better approach: Track write state per request/transaction and force primary reads for a configurable window after any write in that session.
Q5. A batch import inserts 10M rows over 30 minutes. Replica lag spikes to 45 seconds. Users see missing data on dashboards reading from replicas. What's the root cause and two ways to mitigate it?
Root cause: Async replication can't keep up with the write throughput of the batch job. The replica is applying WAL records as fast as it can, but the primary is generating them faster — lag accumulates. Any replica reads during this window return data that's up to 45 seconds stale.
Two mitigations:
- Route dashboards to primary during known batch windows. If the import runs nightly 2–4 AM, your routing layer can temporarily direct all reads to the primary during that window. Adds load to primary but eliminates staleness.
- Throttle the batch import to limit WAL generation rate. Add
pg_sleep(0.01)between batches, or reduce batch size. Slows the import but keeps replica lag low — users see consistent data throughout.
A third option: expose replica lag as a metric, and have the application fall back to the primary automatically if lag exceeds a threshold (e.g., >5 seconds).
Part 5 done. Next up — Part 6: Scaling Writes. Partitioning, batch inserts, UPSERT patterns, and how to stop a single Postgres primary from becoming your write bottleneck.