PostgresProduction

Postgres Internals 105: Scaling Reads

March 24, 202613 min readPART 11 / 19

A single Postgres instance handles a few hundred connections before memory pressure kicks in. It handles reads fine until reporting queries start competing with user-facing transactions. It handles your current load until it doesn't. Part 5 of the Postgres Internals series covers how to scale reads with replicas, how to stop killing your database with connection counts, and how to use caching without introducing bugs that are worse than the slowness you fixed.

The problem: one instance can't do everything

A default Postgres installation allows around 100 connections. Each connection consumes roughly 5-10 MB of RAM for its backend process. At 500 connections you're using 2.5-5 GB of RAM just for connection overhead, before a single query runs. And that's before reads start competing with writes.

The three tools that solve this operate at different layers:

Read replicas offload SELECT queries to copies of your database, freeing the primary for writes.
Connection pooling shares a small pool of real Postgres connections across many app connections.
Caching lets you avoid the database entirely for data that doesn't change often.

Each layer solves a different bottleneck. Apply the wrong one to the wrong problem and you waste time while introducing new failure modes.

Read replicas: streaming replication

Postgres streaming replication works by sending WAL (Write-Ahead Log) records from the primary to one or more standby servers in near real-time. The standby replays those records to keep its data in sync.

                  ┌──────────────┐
  Writes ────────►│   Primary    │
                  │  (Postgres)  │
                  └──────┬───────┘
                         │ WAL stream (async by default)
            ┌────────────┼────────────┐
            ▼            ▼            ▼
      ┌─────────┐  ┌─────────┐  ┌─────────┐
      │Replica 1│  │Replica 2│  │Replica 3│
      │(reads)  │  │(reads)  │  │(analytics)│
      └─────────┘  └─────────┘  └─────────┘

  Replica lag: usually <100ms, can spike during heavy writes
  Replicas are read-only: writes fail with an error

Your application needs to route traffic explicitly. Writes go to the primary connection string, reads go to a replica. Most ORMs support this natively (Prisma, Sequelize, SQLAlchemy all have read/write splitting).

// Node.js example — explicit routing
const primary = new Pool({ connectionString: process.env.DATABASE_PRIMARY_URL });
const replica  = new Pool({ connectionString: process.env.DATABASE_REPLICA_URL });

// Writes → primary
async function createUser(data: UserInput) {
  return primary.query('INSERT INTO users ...', [data]);
}

// Reads → replica
async function getUser(id: string) {
  return replica.query('SELECT * FROM users WHERE id = $1', [id]);
}

Replication lag: the silent bug

Replicas are eventually consistent. By default, Postgres replicates asynchronously. The primary commits and confirms to the client before the replica has applied the change. Lag is usually under 100ms. During a large write batch, it can spike to seconds or minutes.

The most common bug this causes is read-your-own-writes failures. User creates a post, you redirect them to the post page, the page reads from a replica that hasn't caught up yet, and the page says "Post not found."

I've shipped this bug myself. Twice.

// The bug
async function createPostAndRedirect(data: PostInput) {
  await primary.query('INSERT INTO posts ...', [data]);
  redirect(`/posts/${data.slug}`); // page reads from replica → 404
}

// The fix: read-after-write goes to primary
async function getPostAfterCreate(slug: string) {
  // Immediately after a write, read from primary
  return primary.query('SELECT * FROM posts WHERE slug = $1', [slug]);
}

// Better: sticky routing for the creating user's session for ~2 seconds
// OR: synchronous replication for critical write paths (performance cost)

Connection pooling: PgBouncer

Every connection to Postgres spawns a backend process. That process holds memory whether it's running a query or idle. A Rails app with 20 Puma workers × 5 threads = 100 connections before you've deployed a second instance. PgBouncer sits between your app and Postgres, maintaining a small pool of real connections and queuing requests.

  Without pooling:
  ──────────────────────────────────────────────────
  App (500 threads) → 500 Postgres backend processes
  Each: ~10 MB RAM → 5 GB RAM just for connections

  With PgBouncer:
  ──────────────────────────────────────────────────
  App (500 threads) → PgBouncer → 20 Postgres connections
  500 app connections share 20 real connections
  RAM for connections: ~200 MB

PgBouncer modes — this choice matters

PgBouncer has three pooling modes. Most engineers pick the wrong one and then wonder why their transactions are broken.

Mode	Connection returned after	Safe for	Breaks
Session	Client disconnects	Everything	Nothing, but barely helps
Transaction	Transaction ends (COMMIT/ROLLBACK)	Most apps ✅	Prepared statements, `SET` commands, advisory locks
Statement	Each statement	Almost nothing	Multi-statement transactions, sessions

Use transaction mode. It's the right default for almost every web application. If your ORM uses prepared statements (Prisma does), disable them at the driver level when using PgBouncer transaction mode. They don't work across connection hand-offs.

# pgbouncer.ini
[databases]
mydb = host=postgres-primary port=5432 dbname=mydb

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000       ; app-facing connections PgBouncer accepts
default_pool_size = 25       ; real Postgres connections per database/user pair
min_pool_size = 5
reserve_pool_size = 5        ; emergency connections for traffic spikes
server_idle_timeout = 600

Caching with Redis

Replicas and pooling make your database faster under load. Caching lets you skip it entirely for data that hasn't changed. The tradeoff is staleness, and getting cache invalidation wrong is how you serve stale prices at checkout or (worse) leak one user's data to another.

Cache-aside pattern

// Cache-aside: the application manages the cache explicitly
async function getUserProfile(userId: string): Promise<User> {
  const cacheKey = `user:${userId}`;

  // 1. Check cache first
  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  // 2. Cache miss → hit database
  const user = await db.query('SELECT * FROM users WHERE id = $1', [userId]);

  // 3. Write to cache with TTL
  await redis.setex(cacheKey, 300, JSON.stringify(user)); // 5 min TTL

  return user;
}

// On update: invalidate the cache
async function updateUser(userId: string, data: Partial<User>) {
  await db.query('UPDATE users SET ... WHERE id = $1', [userId]);
  await redis.del(`user:${userId}`); // invalidate
}

What to cache (and what not to)

Good candidates for caching: user profiles, product listings, configuration data, aggregated counts. Anything that changes rarely and is read often.

Bad candidates: paginated lists (one insert invalidates every page) and anything requiring strong consistency like account balances or inventory counts. Data that changes on every request is also a bad fit.

// Don't cache paginated lists, invalidation is a nightmare
const cacheKey = `products:page:${page}:limit:${limit}`;
// Any new product means all pages are stale → cache thrash

// Cache the individual items, not the list
const cacheKey = `product:${productId}`;
// List queries hit the DB, but individual product fetches are cached

Always set a TTL

A cache key with no TTL lives forever. Your Redis grows unboundedly, stale data accumulates, users see a deleted product, and a user who changed their name 6 months ago still sees the old one. Always set a TTL, even if it's 24 hours. The right TTL is the longest your business can tolerate stale data for that entity.

Putting it together

  PRODUCTION READ SCALING STACK
  ──────────────────────────────────────────────────────────────

                    ┌──────────────┐
    Writes ────────►│   Primary    │◄── PgBouncer (transaction mode)
                    └──────┬───────┘         ▲
                           │ WAL             │ write path
              ┌────────────┴───────┐         │
              ▼                    ▼         │
        ┌──────────┐         ┌──────────┐    │
        │Replica 1 │         │Replica 2 │    │
        └────┬─────┘         └────┬─────┘    │
             │                    │          │
             └────────┬───────────┘          │
                      ▼                      │
               PgBouncer (reads)             │
                      │                      │
              ┌───────┴────────┐             │
              ▼                ▼             │
          App Server 1    App Server 2 ──────┘
              │
              ▼
           Redis ←── cache-aside, TTL on everything

Quiz: scaling reads

Q1. Your app creates a user, then immediately redirects to their profile page. The profile shows "User not found." The INSERT succeeded. What's happening and how do you fix it?

Replication lag. You wrote to the primary, then redirected. The profile page read from a replica that hadn't replicated the new row yet. Even a few milliseconds behind is enough.

Fix options:

Read immediately-after-write from the primary instead of the replica.
Implement session stickiness so for 2-3 seconds after a write, that user's reads route to the primary.
Use synchronous replication (synchronous_commit = on) for critical write paths. This adds latency though.

// Simple fix: read-after-write uses primary connection
const user = await primary.query('SELECT * FROM users WHERE id = $1', [newUserId]);

Q2. You configure PgBouncer in statement mode to maximise connection efficiency. Your app uses SQLAlchemy with session.begin() for transactions. What breaks and why?

Multi-statement transactions break entirely. In statement mode, PgBouncer returns the connection to the pool after every single SQL statement. BEGIN runs on connection A. The next statement runs on connection B. The COMMIT runs on connection C. There is no coherent transaction. The database never sees a complete BEGIN...COMMIT on the same connection.

Statement mode also breaks SET commands (session settings reset), prepared statements, advisory locks, and LISTEN/NOTIFY.

Fix: use transaction mode. It supports BEGIN...COMMIT correctly because the connection is held for the duration of one transaction, then returned.

Q3. Your cache-aside stores user profiles with no TTL. After 6 months, Redis is 40 GB and growing. Only 200k of 2M users are monthly-active. What's wrong?

No TTL means keys never expire. Every user profile ever cached stays in Redis forever. You're caching 2M users but only 200k are active, so 1.8M stale cache entries that nobody reads are consuming most of your 40 GB.

Fix: always set a TTL. For user profiles, 30-60 minutes is usually right. Redis will automatically evict the key when it expires, and the next read for that user repopulates it from the database.

// Always include a TTL
await redis.setex(`user:${userId}`, 3600, JSON.stringify(user)); // 1 hour

You can also configure Redis with maxmemory-policy allkeys-lru as a safety net so Redis evicts the least-recently-used keys when memory is full. But TTLs are the real fix.

Q4. Spot the bug: your app routes reads to replicas when the query starts with SELECT. Under what condition does this fail dangerously?

def get_connection(query):
    if query.strip().upper().startswith("SELECT"):
        return replica_pool.get()
    return primary_pool.get()

It fails for read-your-own-writes after any write. After an INSERT or UPDATE, if the very next query is a SELECT (checking the result of the write), this logic routes it to the replica. The replica may not have replicated the change yet, so the SELECT returns stale data or no data.

It also fails for SELECT ... FOR UPDATE. That's a locking read that must run on the primary, but this code sends it to a replica where it either errors out or silently acquires no lock.

Better approach: track write state per request/transaction and force primary reads for a configurable window after any write in that session.

Q5. A batch import inserts 10M rows over 30 minutes. Replica lag spikes to 45 seconds. Users see missing data on dashboards reading from replicas. What's the root cause and two ways to mitigate it?

Root cause: async replication can't keep up with the write throughput of the batch job. The replica is applying WAL records as fast as it can, but the primary is generating them faster, so lag accumulates. Any replica reads during this window return data that's up to 45 seconds stale.

Two mitigations:

Route dashboards to primary during known batch windows. If the import runs nightly 2-4 AM, your routing layer can temporarily direct all reads to the primary during that window. Adds load to primary but eliminates staleness.
Throttle the batch import to limit WAL generation rate. Add pg_sleep(0.01) between batches, or reduce batch size. Slows the import but keeps replica lag low, so users see consistent data throughout.

A third option is to expose replica lag as a metric and have the application fall back to the primary automatically if lag exceeds a threshold (say, more than 5 seconds).

Part 5 done. Next up: Part 6: Scaling Writes. Partitioning, batch inserts, UPSERT patterns, and how to stop a single Postgres primary from becoming your write bottleneck.

← PREV

Staff Prep 06: Transactions & Concurrency — Isolation, Deadlocks & Race Conditions