ECS Autoscaling Fought Our Postgres max_connections at 2AM and Postgres Won
2:14 AM, Saturday. PagerDuty woke me up. API error rate had gone from 0% to 78% in under
four minutes. Eleven thousand users were mid-checkout during a flash sale we'd announced
via email six hours earlier. The error was one I'd never seen in production:
remaining connection slots are reserved for non-replication superuser connections.
It took us 2.5 hours to fully understand what happened. The root cause was embarrassingly
simple math we'd never done.
Production failure
The timeline was brutal in its speed. 2:10 AM, the flash sale promo email hit 80,000 inboxes. 2:12 AM, traffic had tripled. ECS autoscaling kicked in exactly as designed. 2:14 AM, health checks started failing. 2:18 AM, 78% of API requests were returning 500s.
CloudWatch showed ECS scaling from 5 tasks to 38 tasks in roughly 12 minutes. P99 latency went from 180ms to 30 seconds (our timeout limit) and then to complete connection refusals. Roughly $14,200 in transactions failed or were abandoned. We manually rolled back to 5 tasks and restored service at 4:42 AM. 2 hours and 28 minutes after the first alert.
TIMELINE OF COLLAPSE 02:10 AM Promo email delivered → traffic 3× normal 02:12 AM ECS autoscaling triggers [5 tasks → scaling up] 02:14 AM First health check fails [error rate: 12%] 02:16 AM New tasks can't connect [error rate: 41%] 02:18 AM Connection slots exhausted [error rate: 78%] 02:19 AM Old tasks start failing too [error rate: 94%] 02:22 AM On-call engineer joins bridge 04:42 AM Service restored (manual rollback + pool resize) Users affected: ~11,400 Failed revenue: ~$14,200 Time to resolve: 2h 28m
False assumptions
Our autoscaling configuration looked fine on paper. Sensible CPU and memory thresholds,
tested deployments under load, validated individual task health. What we'd never done
was treat max_connections as a hard ceiling that every new task competed for.
The assumption baked into our infra was simple: more tasks, more traffic handled.
That's true in a stateless world. Every Node.js task we ran used knex with
a connection pool, and every pool was configured with the same environment variable,
PG_POOL_SIZE=10, which we'd set once years ago and never touched. Reasonable
for 5 tasks. Catastrophic for 38.
We also assumed the RDS instance was sized generously. Six months earlier we'd upgraded
it from db.t3.small to db.t3.medium for performance. Nobody
re-checked what that meant for max_connections. On RDS, that value is
formula-driven: LEAST(DBInstanceClassMemory / 9531392, 5000). A
db.t3.medium has 4 GB of RAM, giving it a max_connections of
about 170. I did not know any of this at 2:14 AM.
Investigation
The error message itself was the first real clue.
remaining connection slots are reserved for non-replication superuser connections
is Postgres's way of saying it's completely out of connections. Not slow. Exhausted.
I queried pg_stat_activity from an admin connection (thankfully Postgres reserves
3 slots for superusers by default):
SELECT count(*), state, wait_event_type
FROM pg_stat_activity
WHERE datname = 'proddb'
GROUP BY state, wait_event_type
ORDER BY count DESC;
-- count | state | wait_event_type
-- -------+--------+-----------------
-- 167 | active | Client
-- 0 | idle |
-- 0 | idle |
-- (3 rows)
SELECT setting FROM pg_settings WHERE name = 'max_connections';
-- 170
All 167 available slots were in use. Meanwhile ECS was trying to start new tasks and
each one needed at least 1 connection to pass its health check, which hit a
/healthz endpoint that ran a SELECT 1. They couldn't get a
connection, so they failed health checks, so ECS terminated them and started fresh
ones, which also couldn't connect. The autoscaler was in a death loop, burning
connection attempts without ever succeeding.
When existing tasks then tried to acquire new pool connections as idle ones aged out, they started failing too. Within minutes even the healthy tasks were rejecting requests.
THE CONNECTION MATH (why it broke) ECS tasks: 38 Pool size/task: × 10 ───────────────────── Total attempted: 380 connections RDS db.t3.medium max_connections: 170 Reserved for superuser: - 3 ───────────────────────────────────────── Available to app: 167 Overflow: 380 - 167 = 213 connections REFUSED New tasks → health check fails → ECS cycles → repeat Old tasks → pool refresh fails → requests error → cascade
Root cause
The root cause was a missing invariant. We had never encoded the constraint
MAX_TASKS × POOL_SIZE < max_connections anywhere. Not in code, docs, IaC,
or alerts. Not in runbooks. Not in the autoscaling config. It simply didn't exist as a
concept in our system design.
Compounding this: we'd upgraded RDS for performance, not knowing max_connections
is RAM-proportional. Going from db.t3.small (2 GB) to db.t3.medium
(4 GB) doubled max_connections from ~85 to ~170. That felt like plenty. We
never wrote the number down or checked it against the autoscaling ceiling.
The cascade was worse because ECS's default health check grace period was 30 seconds. During those 30 seconds, a new task held pool slots even while failing health checks. Each doomed task burned connections for 30 seconds before ECS killed it and tried again. There were always "zombie" connection attempts draining the pool.
The fix
The emergency fix was manual. Reduce PG_POOL_SIZE to 2, set ECS desired
count back to 5, let the task churn settle. Service was restored 8 minutes after those
changes.
The permanent fix had three parts. First, we added PgBouncer as a connection pooler in transaction pooling mode between ECS tasks and RDS.
[databases]
proddb = host=rds-endpoint.us-east-1.rds.amazonaws.com port=5432 dbname=proddb
[pgbouncer]
listen_port = 5432
listen_addr = 0.0.0.0
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
; Transaction pooling: connection released after each transaction
pool_mode = transaction
; Max connections PgBouncer holds to Postgres
server_pool_size = 80
max_db_connections = 100
; Client-facing: up to 1000 app connections multiplexed
max_client_conn = 1000
; Keep some connections warm
min_pool_size = 5
reserve_pool_size = 10
With PgBouncer, our ECS tasks connect to the pooler instead of directly to RDS. PgBouncer maintains at most 100 real Postgres connections regardless of how many ECS tasks exist. In transaction mode, a server connection is only held for the duration of a transaction. Idle app connections consume zero server connections.
Second, we added a startup assertion in our Node.js app that fails fast if the math is wrong:
const POOL_SIZE = parseInt(process.env.PG_POOL_SIZE ?? '5', 10);
const MAX_ECS_TASKS = parseInt(process.env.ECS_MAX_TASKS ?? '50', 10);
const PG_MAX_CONNECTIONS = parseInt(process.env.PG_MAX_CONNECTIONS ?? '170', 10);
// Fail loudly at startup rather than silently at 2AM
const worstCaseConnections = MAX_ECS_TASKS * POOL_SIZE;
if (worstCaseConnections >= PG_MAX_CONNECTIONS * 0.8) {
throw new Error(
`Connection math unsafe: ${MAX_ECS_TASKS} tasks × ${POOL_SIZE} pool = ` +
`${worstCaseConnections} connections ≥ 80% of max (${PG_MAX_CONNECTIONS}). " +
`Reduce PG_POOL_SIZE or add PgBouncer.`
);
}
const pool = knex({
client: 'pg',
connection: { host: process.env.PGBOUNCER_HOST, /* ... */ },
pool: { min: 1, max: POOL_SIZE },
});
Third, we added a CloudWatch alarm on the RDS DatabaseConnections metric,
firing at 70% of max_connections, so we'd know long before saturation.
ARCHITECTURE: BEFORE vs AFTER
BEFORE
──────────────────────────────────────
[ECS Task 1]─┐
[ECS Task 2]─┤ (each: 10 direct connections)
[ECS Task 3]─┼─────────────────▶ [RDS Postgres]
... │ max_conn: 170
[ECS Task 38]┘
Total attempted: 380 → EXHAUSTED
AFTER
──────────────────────────────────────
[ECS Task 1] ─┐
[ECS Task 2] ─┤ (each: 5 connections
[ECS Task 3] ─┤ to PgBouncer)
... ├──▶ [PgBouncer]──────▶ [RDS Postgres]
[ECS Task N] ─┘ max_client: 1000 server_pool: 100
tx pooling mode max_conn: 170
Postgres sees ≤ 100 connections regardless of ECS scale
Lessons learned
Every shared resource needs a capacity formula. Autoscaling changes the multiplier on
your resource consumption. Database connections, Redis connection limits, third-party
API rate limits, anything shared across tasks needs a hard invariant:
MAX_TASKS × PER_TASK_USAGE < RESOURCE_LIMIT. Write it down. Encode it as
a startup assertion. Alert on it at 70%.
RDS instance resizes silently change max_connections. When you resize for
performance, the value changes with RAM, not in the direction you'd expect if you think
of it as a config value. After every instance class change, run
SELECT setting FROM pg_settings WHERE name = 'max_connections'; and update
your capacity math.
PgBouncer is not optional for ECS or Kubernetes workloads. When task count is dynamic, direct Postgres connections are a liability. Transaction pooling mode decouples app-layer concurrency from Postgres server connections. It's a 30-minute setup that prevents exactly this class of incident.
Health check design matters during connection exhaustion. Our /healthz
endpoint ran a SELECT 1, which needs a DB connection. During exhaustion,
every new task burned connections just to confirm it was unhealthy. We split the health
check: a shallow /ping (no DB) for ECS health, and a deeper
/ready (with DB) used only during deployments.
The most dangerous outages are math problems, not code bugs. There was no bug in our application code. There was no misconfiguration in Postgres. The system worked exactly as designed. We just hadn't done the arithmetic before enabling autoscaling. I added a capacity planning section to every service's runbook after this. Shared resources, per-task usage, ceiling.