How a Redis Connection Leak Crashed Our AWS ECS Cluster at 3AM
3:12 AM, Tuesday. PagerDuty fires for our primary AWS ECS cluster. Load balancer health checks failing across all three production tasks. Within four minutes the React SSR service is returning 502s to 100% of traffic, about 2,100 active users mid-session. We scaled from 3 to 6 ECS tasks. Every new task died inside 90 seconds. The outage ran 47 minutes before we understood what was actually happening, and most of those minutes were me being confidently wrong.
The 3AM alert: tasks dying faster than we could replace them
The first CloudWatch alarm showed api-ssr-prod with 0 healthy tasks. The ALB target group listed every target as unhealthy. The health endpoint (/api/health, normally a boring HTTP 200) wasn't responding in time.
Instinct said traffic spike, so we triggered a horizontal scale-out:
aws ecs update-service --cluster prod-cluster --service api-ssr-prod --desired-count 6
Each new task launched, climbed to 2,048 MB of memory, and got OOM-killed before completing a single health check cycle. The scale-out made things actively worse: six dying tasks instead of three.
False assumptions: we blamed everything except the code
Hypothesis one: AWS infrastructure fault. Us-east-1 had a partial EBS degradation two months earlier, so that's where muscle memory took me first. Service Health Dashboard was green across the board.
Hypothesis two: memory regression from the deploy six hours ago. Rolled back to the previous task definition. Tasks still climbed to 2,048 MB and died.
Hypothesis three: traffic anomaly or DDoS. Request rate at 3 AM was 340 req/s, normal for that hour. CloudFront and Route 53 logs were boring.
Twenty-eight minutes chasing these.
The real problem had been running silently since 9:08 PM, six hours before the alert. The deploy I had reflexively rolled back was, in fact, the deploy that introduced the bug. The rollback put the same bug back, because I didn't roll back far enough. I figured this out much later than I'd like to admit.
CloudWatch container insights: six hours of steady climb
Pulling the MemoryUtilization metric for the prior 12 hours showed a perfectly linear slope. Starting at 240 MB after the 9:08 PM deploy, climbing at a constant 5.3 MB/min, hitting the 2,048 MB hard limit at 3:12 AM. No spike. No anomaly. A leak.
ECS Task Memory (MB) — 12-Hour Window
─────────────────────────────────────────────────────────────
2048 | ████ <- OOM KILL
| ████
1536 | ████
| ████
1024 | ████
| ████
512 | ████
| ████
240 |████████████ <- deploy at 9:08 PM
└────────────────────────────────────────────────────────
9PM 10PM 11PM 12AM 1AM 2AM 3AM
^-- outage
─────────────────────────────────────────────────────────────
Slope: +5.3 MB/min Duration: 344 min Tasks: 3 (all same)
Three tasks with the exact same curve, no divergence, which rules out a per-task anomaly. Node.js heap metrics from process.memoryUsage() on /metrics stayed flat at around 180 MB. The growing memory was not the V8 heap. It was native OS handles.
We ran INFO clients against the production Redis instance:
$ redis-cli -u ${REDIS_URL} INFO clients
# Clients
connected_clients:8847
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0
A service with 3 ECS tasks should have had 3 Redis connections. It had 8,847.
Root cause: createClient() called on every SSR request
Six hours earlier, a developer had added server-side caching to a product listing page. The Redis client was instantiated inside the async function rather than at module level:
// BAD: runs on EVERY request — new client, new TCP+TLS socket, never closed
export async function getServerSideProps() {
const { createClient } = await import('redis');
const client = createClient({
url: process.env.REDIS_URL,
socket: { tls: true },
});
await client.connect();
const cached = await client.get('products:all');
// client.disconnect() never called — function returns, local var gc'd
// but the OS socket handle is NEVER released
return {
props: { products: cached ? JSON.parse(cached) : [] },
};
}
createClient() was being called on every SSR request. With TLS, each call opened a TCP connection and negotiated an SSL context. The local client variable went out of scope when the function returned, and Node.js GC cannot release OS file descriptors. The socket lived until Redis itself decided to close it. Production Redis had timeout 0 (disabled) to avoid dropping long-running background job connections.
At 340 req/s with around 80 ms SSR latency, roughly 27 requests ran concurrently. Over 344 minutes that's about 7.0 million requests, 8,847 leaked connections, and 5.3 MB/min of accumulated OS socket buffer and SSL context memory. Exactly matching the CloudWatch slope.
BROKEN: New Redis Client Per SSR Request
═══════════════════════════════════════════════════════
Browser
|
v
ECS Task (Node.js process)
|
v
getServerSideProps()
|
+--> createClient() <-- NEW client every request
| client.connect() <-- NEW TCP+TLS socket opened
| client.get(key)
| return props
| [client goes out of scope]
| [OS socket NEVER closed] <-- LEAK
v
Redis: 8,847 open connections
ECS: 2,048 MB hard limit --> OOM KILL
═══════════════════════════════════════════════════════
FIXED: Module-Level Singleton (1 connection per task)
═══════════════════════════════════════════════════════
[Module load]
|
+--> createClient() ONCE
| client.connect() ONCE
v
client singleton (shared across all requests)
|
| Browser
| |
| v
| ECS Task --> getServerSideProps()
| |
+<--------------------+ reuse existing client
|
v
Redis: 3 open connections (1 per task)
ECS: ~180 MB stable
═══════════════════════════════════════════════════════
Architecture fix: singleton client with cold-start guard
The fix was a module-level singleton with a concurrent-initialization guard. One client per Node.js process, initialized once, reused across all requests regardless of how many arrive during the initial cold start.
import { createClient, RedisClientType } from 'redis';
let client: RedisClientType | null = null;
let connectPromise: Promise<void> | null = null;
export async function getRedisClient(): Promise<RedisClientType> {
if (client?.isReady) return client;
// If a connect is already in flight, wait for it (thundering herd guard)
if (connectPromise) {
await connectPromise;
return client!;
}
client = createClient({
url: process.env.REDIS_URL,
socket: {
tls: process.env.NODE_ENV === 'production',
reconnectStrategy: (retries) => Math.min(retries * 50, 2000),
},
});
client.on('error', (err) => console.error('[Redis] error:', err));
connectPromise = client
.connect()
.finally(() => { connectPromise = null; });
await connectPromise;
return client;
}
// Graceful drain — ECS sends SIGTERM before killing the task
process.on('SIGTERM', async () => {
if (client?.isReady) await client.disconnect();
});
Why a singleton over a pool? Redis is single-threaded. One connection handles concurrent pipelined commands efficiently, and a pool adds overhead and extra open handles for no throughput benefit in an SSR workload. The connectPromise guard matters: during ECS cold starts, multiple SSR requests can arrive before the first connection finishes. Without the guard, each request races to call createClient(), which is the exact pattern we just fixed. Would be poetic if it weren't so embarrassing.
We also tightened the ECS task definition memory envelope:
{
"memory": 1024,
"memoryReservation": 512,
"environment": [
{
"name": "NODE_OPTIONS",
"value": "--max-old-space-size=768"
}
]
}
Setting --max-old-space-size=768 explicitly caps the V8 heap and forces earlier GC cycles. Before this, Node defaulted to roughly 1.4 GB heap on a 2 GB container, leaving almost no headroom for native handles or the Next.js route cache before the ECS hard limit. The new 1,024 MB hard limit sits 256 MB above the explicit V8 ceiling, which is enough room for a CloudWatch alarm to fire before an OOM kill.
Why staging didn't catch this
Staging Redis had timeout 300 (the Redis default). Leaked connections were evicted every 5 minutes, so memory never climbed enough to alarm. Production Redis had timeout 0, a deliberate setting to avoid dropping long-running background job connections. One config delta made the leak completely invisible in staging.
Our CI load test ran for 90 seconds. A 5.3 MB/min leak produces 8 MB over 90 seconds, which is undetectable against normal variance. The same test run for 15 minutes would have shown 80 MB of growth and caught it immediately.
Lessons learned
- Module-level singletons for all I/O clients. Redis, database connections, HTTP agents. Initialize once at module load, never inside request handlers. Dynamic
import()inside async functions is especially dangerous because the module cache doesn't prevent re-initialization of module-level state. - Staging Redis config has to mirror production.
timeout 0in prod vstimeout 300in staging made the leak invisible before deploy. Treat connection timeout config as a correctness concern, not just an ops preference. - Add a memory soak test to CI. A 15-minute constant-load test with a memory growth assertion (<5% increase over the final 5 minutes) would have caught this before merge. Added it the following sprint.
- Monitor Redis
connected_clientsas a canary. Client count should be flat relative to ECS task count, not proportional to request rate. A rising ratio is a connection leak, catchable hours before memory becomes critical. - Set ECS
memoryReservationplus a CloudWatch alarm at 80%. Hard memory limits are silent killers. Having a soft reservation plus an alarm at 80% of the hard limit gives you a window to diagnose before the OOM kill fires.
The 47-minute outage bought us a postmortem, a Redis monitoring dashboard, a soak test in CI, and a team convention that no I/O client gets initialized inside a request handler. We added an ESLint rule to flag createClient calls inside async functions. It's caught two similar patterns in the three months since.