How a Redis Connection Leak Crashed Our AWS ECS Cluster at 3AM
← Back
March 12, 2026Docker8 min read

How a Redis Connection Leak Crashed Our AWS ECS Cluster at 3AM

Published March 12, 20268 min read

3:12 AM, Tuesday. PagerDuty fires for our primary AWS ECS cluster. Load balancer health checks failing across all three production tasks. Within four minutes the React SSR service is returning 502s to 100% of traffic, about 2,100 active users mid-session. We scaled from 3 to 6 ECS tasks. Every new task died inside 90 seconds. The outage ran 47 minutes before we understood what was actually happening, and most of those minutes were me being confidently wrong.


The 3AM alert: tasks dying faster than we could replace them

The first CloudWatch alarm showed api-ssr-prod with 0 healthy tasks. The ALB target group listed every target as unhealthy. The health endpoint (/api/health, normally a boring HTTP 200) wasn't responding in time.

Instinct said traffic spike, so we triggered a horizontal scale-out:

scale-out attempt — ecs-scale.sh
aws ecs update-service   --cluster prod-cluster   --service api-ssr-prod   --desired-count 6

Each new task launched, climbed to 2,048 MB of memory, and got OOM-killed before completing a single health check cycle. The scale-out made things actively worse: six dying tasks instead of three.

47 mintotal outage duration
2,100active users affected
100%502 error rate at peak
6ECS tasks OOM-killed before root cause found

False assumptions: we blamed everything except the code

Hypothesis one: AWS infrastructure fault. Us-east-1 had a partial EBS degradation two months earlier, so that's where muscle memory took me first. Service Health Dashboard was green across the board.

Hypothesis two: memory regression from the deploy six hours ago. Rolled back to the previous task definition. Tasks still climbed to 2,048 MB and died.

Hypothesis three: traffic anomaly or DDoS. Request rate at 3 AM was 340 req/s, normal for that hour. CloudFront and Route 53 logs were boring.

Twenty-eight minutes chasing these.

The real problem had been running silently since 9:08 PM, six hours before the alert. The deploy I had reflexively rolled back was, in fact, the deploy that introduced the bug. The rollback put the same bug back, because I didn't roll back far enough. I figured this out much later than I'd like to admit.


CloudWatch container insights: six hours of steady climb

Pulling the MemoryUtilization metric for the prior 12 hours showed a perfectly linear slope. Starting at 240 MB after the 9:08 PM deploy, climbing at a constant 5.3 MB/min, hitting the 2,048 MB hard limit at 3:12 AM. No spike. No anomaly. A leak.

ECS Task Memory (MB) — 12-Hour Window
─────────────────────────────────────────────────────────────
2048 |                                        ████ <- OOM KILL
     |                                    ████
1536 |                                ████
     |                            ████
1024 |                        ████
     |                    ████
 512 |                ████
     |            ████
 240 |████████████ <- deploy at 9:08 PM
     └────────────────────────────────────────────────────────
      9PM    10PM    11PM    12AM    1AM    2AM    3AM
                                                   ^-- outage
─────────────────────────────────────────────────────────────
Slope: +5.3 MB/min   Duration: 344 min   Tasks: 3 (all same)

Three tasks with the exact same curve, no divergence, which rules out a per-task anomaly. Node.js heap metrics from process.memoryUsage() on /metrics stayed flat at around 180 MB. The growing memory was not the V8 heap. It was native OS handles.

We ran INFO clients against the production Redis instance:

redis-cli — client count check
$ redis-cli -u ${REDIS_URL} INFO clients

# Clients
connected_clients:8847
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0

A service with 3 ECS tasks should have had 3 Redis connections. It had 8,847.


Root cause: createClient() called on every SSR request

Six hours earlier, a developer had added server-side caching to a product listing page. The Redis client was instantiated inside the async function rather than at module level:

pages/api/products.ts — broken pattern
// BAD: runs on EVERY request — new client, new TCP+TLS socket, never closed
export async function getServerSideProps() {
  const { createClient } = await import('redis');

  const client = createClient({
    url: process.env.REDIS_URL,
    socket: { tls: true },
  });

  await client.connect();
  const cached = await client.get('products:all');

  // client.disconnect() never called — function returns, local var gc'd
  // but the OS socket handle is NEVER released

  return {
    props: { products: cached ? JSON.parse(cached) : [] },
  };
}

createClient() was being called on every SSR request. With TLS, each call opened a TCP connection and negotiated an SSL context. The local client variable went out of scope when the function returned, and Node.js GC cannot release OS file descriptors. The socket lived until Redis itself decided to close it. Production Redis had timeout 0 (disabled) to avoid dropping long-running background job connections.

At 340 req/s with around 80 ms SSR latency, roughly 27 requests ran concurrently. Over 344 minutes that's about 7.0 million requests, 8,847 leaked connections, and 5.3 MB/min of accumulated OS socket buffer and SSL context memory. Exactly matching the CloudWatch slope.

BROKEN: New Redis Client Per SSR Request
═══════════════════════════════════════════════════════
 Browser
    |
    v
 ECS Task (Node.js process)
    |
    v
 getServerSideProps()
    |
    +--> createClient()       <-- NEW client every request
    |    client.connect()     <-- NEW TCP+TLS socket opened
    |    client.get(key)
    |    return props
    |    [client goes out of scope]
    |    [OS socket NEVER closed]     <-- LEAK
    v
 Redis: 8,847 open connections
 ECS:   2,048 MB hard limit --> OOM KILL
═══════════════════════════════════════════════════════

FIXED: Module-Level Singleton (1 connection per task)
═══════════════════════════════════════════════════════
 [Module load]
    |
    +--> createClient() ONCE
    |    client.connect() ONCE
    v
 client singleton (shared across all requests)
    |
    |   Browser
    |      |
    |      v
    |   ECS Task --> getServerSideProps()
    |                     |
    +<--------------------+  reuse existing client
    |
    v
 Redis: 3 open connections (1 per task)
 ECS:   ~180 MB stable
═══════════════════════════════════════════════════════

Architecture fix: singleton client with cold-start guard

The fix was a module-level singleton with a concurrent-initialization guard. One client per Node.js process, initialized once, reused across all requests regardless of how many arrive during the initial cold start.

lib/redis.ts — singleton with guard
import { createClient, RedisClientType } from 'redis';

let client: RedisClientType | null = null;
let connectPromise: Promise<void> | null = null;

export async function getRedisClient(): Promise<RedisClientType> {
  if (client?.isReady) return client;

  // If a connect is already in flight, wait for it (thundering herd guard)
  if (connectPromise) {
    await connectPromise;
    return client!;
  }

  client = createClient({
    url: process.env.REDIS_URL,
    socket: {
      tls: process.env.NODE_ENV === 'production',
      reconnectStrategy: (retries) => Math.min(retries * 50, 2000),
    },
  });

  client.on('error', (err) => console.error('[Redis] error:', err));

  connectPromise = client
    .connect()
    .finally(() => { connectPromise = null; });

  await connectPromise;
  return client;
}

// Graceful drain — ECS sends SIGTERM before killing the task
process.on('SIGTERM', async () => {
  if (client?.isReady) await client.disconnect();
});

Why a singleton over a pool? Redis is single-threaded. One connection handles concurrent pipelined commands efficiently, and a pool adds overhead and extra open handles for no throughput benefit in an SSR workload. The connectPromise guard matters: during ECS cold starts, multiple SSR requests can arrive before the first connection finishes. Without the guard, each request races to call createClient(), which is the exact pattern we just fixed. Would be poetic if it weren't so embarrassing.

We also tightened the ECS task definition memory envelope:

task-definition.json — memory settings
{
  "memory": 1024,
  "memoryReservation": 512,
  "environment": [
    {
      "name": "NODE_OPTIONS",
      "value": "--max-old-space-size=768"
    }
  ]
}

Setting --max-old-space-size=768 explicitly caps the V8 heap and forces earlier GC cycles. Before this, Node defaulted to roughly 1.4 GB heap on a 2 GB container, leaving almost no headroom for native handles or the Next.js route cache before the ECS hard limit. The new 1,024 MB hard limit sits 256 MB above the explicit V8 ceiling, which is enough room for a CloudWatch alarm to fire before an OOM kill.


Why staging didn't catch this

"The integration test hit the endpoint three times. The load test hit it 200 times over 90 seconds. Production hit it 7 million times over 6 hours."

Staging Redis had timeout 300 (the Redis default). Leaked connections were evicted every 5 minutes, so memory never climbed enough to alarm. Production Redis had timeout 0, a deliberate setting to avoid dropping long-running background job connections. One config delta made the leak completely invisible in staging.

Our CI load test ran for 90 seconds. A 5.3 MB/min leak produces 8 MB over 90 seconds, which is undetectable against normal variance. The same test run for 15 minutes would have shown 80 MB of growth and caught it immediately.


Lessons learned

  • Module-level singletons for all I/O clients. Redis, database connections, HTTP agents. Initialize once at module load, never inside request handlers. Dynamic import() inside async functions is especially dangerous because the module cache doesn't prevent re-initialization of module-level state.
  • Staging Redis config has to mirror production. timeout 0 in prod vs timeout 300 in staging made the leak invisible before deploy. Treat connection timeout config as a correctness concern, not just an ops preference.
  • Add a memory soak test to CI. A 15-minute constant-load test with a memory growth assertion (<5% increase over the final 5 minutes) would have caught this before merge. Added it the following sprint.
  • Monitor Redis connected_clients as a canary. Client count should be flat relative to ECS task count, not proportional to request rate. A rising ratio is a connection leak, catchable hours before memory becomes critical.
  • Set ECS memoryReservation plus a CloudWatch alarm at 80%. Hard memory limits are silent killers. Having a soft reservation plus an alarm at 80% of the hard limit gives you a window to diagnose before the OOM kill fires.
8,847leaked Redis connections at peak
5.3 MB/minmemory growth rate before fix
180 MBstable memory after fix (was 2,048 MB)
3Redis connections post-fix (1 per ECS task)

The 47-minute outage bought us a postmortem, a Redis monitoring dashboard, a soak test in CI, and a team convention that no I/O client gets initialized inside a request handler. We added an ESLint rule to flag createClient calls inside async functions. It's caught two similar patterns in the three months since.

Share this
← All Posts8 min read