July 3, 2026Architecture10 min read

From Vercel to AWS, part 2: the breaking points

Published July 3, 202610 min read

$120k ARR. 1,200 users. The CTO got a Slack alert at 3 AM: Supabase dashboard showing 497 active connections. The Pro tier limit is 500. Three connections away from every new request failing. The fix took forty-five minutes — update the PgBouncer pool size, restart the workers. The warning had been sitting in the Supabase dashboard for six weeks.

The modern dev platform stack does not break all at once. It sends signals — specific, readable, predictable — weeks before the actual outage. Most founders miss them because they are not monitoring for them. This is the list.

This is part 2 of the series on the infrastructure journey from the $50/month stack to AWS. Part 1 covered what to build on. Part 2 covers when each service starts showing its limits, what the specific thresholds are, and how to tell the difference between a hard limit and an easy configuration fix.

Supabase: the three limits that matter

Most Supabase breaking points are not Supabase's fault. They are the result of how applications connect to Postgres by default — without connection pooling — combined with Postgres's inherent connection overhead. Supabase provides PgBouncer (a connection pooler) on every project, but it is opt-in.

Breaking point 1: connection limit (most common). Postgres creates an OS process per connection. At 500 connections on Supabase Pro, each holding idle memory, your database compute is spending a non-trivial fraction of its RAM managing connections that are waiting rather than working. The visible symptom is not slowness — it is hard rejections: "FATAL: remaining connection slots are reserved for non-replication superuser connections."

-- How to see connection utilization in Supabase SQL editor
SELECT
  count(*) AS total,
  count(*) FILTER (WHERE state = 'active') AS active,
  count(*) FILTER (WHERE state = 'idle') AS idle,
  count(*) FILTER (WHERE state = 'idle in transaction') AS idle_in_tx,
  (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS max
FROM pg_stat_activity
WHERE datname = current_database();

-- Warning sign: idle_in_tx > 20 is a connection leak
-- Warning sign: idle > 80% of max means your app is not releasing connections
-- Immediate fix: enable PgBouncer transaction mode for your connection string

The immediate fix is switching to the pooled connection string (port 6543 on Supabase, not 5432). PgBouncer's transaction mode multiplexes your application connections onto a smaller pool of real Postgres connections. 200 application connections → 20 actual Postgres connections. This resolves most connection limit issues without any migration.

// Before: direct Postgres connection
DATABASE_URL=postgresql://postgres.abcde:password@aws-0-us-east-1.pooler.supabase.com:5432/postgres

// After: PgBouncer transaction pooler (6543, not 5432)
DATABASE_URL=postgresql://postgres.abcde:password@aws-0-us-east-1.pooler.supabase.com:6543/postgres?pgbouncer=true

// This single change handles most connection pressure up to ~2,000 concurrent users
// Prisma note: add ?pgbouncer=true&connection_limit=1 per worker process

Breaking point 2: egress cost at $0.09/GB after 5GB. At low volume, egress is free. At meaningful query volume — pagination with large text fields, file downloads from Supabase Storage, bulk data exports — egress accumulates. $0.09/GB sounds small. At 200GB/month (a reasonable data-heavy application at $100k ARR), that is $18/month extra. At 1TB/month, it is $90. This is rarely the migration trigger but it is a signal worth watching. The fix is usually application-level: paginate query results, cache read-heavy queries, serve large files from a CDN rather than through the API.

Breaking point 3: Postgres performance at scale. Supabase gives you a managed Postgres, but it does not tune it for you. Missing indexes, inefficient queries, and full table scans that take 3ms at 10k rows take 300ms at 1M rows. The first sign is usually a slow Vercel API route that gets progressively worse week over week. The Supabase dashboard has a query performance panel — check it monthly. If your top slow queries are not indexed, add the index before considering any infrastructure change.

-- Supabase query performance panel equivalent — find slow queries
SELECT
  query,
  calls,
  total_exec_time / calls AS avg_ms,
  rows / calls AS avg_rows
FROM pg_stat_statements
ORDER BY avg_ms DESC
LIMIT 20;

-- Most common fix: missing index on foreign key or WHERE clause column
CREATE INDEX CONCURRENTLY idx_reports_user_id ON reports(user_id);
CREATE INDEX CONCURRENTLY idx_reports_created_at ON reports(created_at DESC);
-- CONCURRENTLY builds the index without locking reads

Vercel: two limits that hit without warning

Breaking point 1: cold start latency in your P99. Vercel serverless functions spin up on demand. When a function has not been called recently, the next call incurs a cold start: downloading your bundle, initializing the runtime, running module-level code. For a small Next.js API route with minimal imports, this is 300-600ms. For a route that imports Supabase, Stripe, OpenAI, and Zod, it can be 1.5-3 seconds.

// What cold start impact looks like in metrics
// Normal requests: p50=45ms, p95=120ms, p99=180ms
// Cold start requests: p50=45ms, p95=2,100ms, p99=3,200ms

// The fix is not migrating — it is reducing bundle size and imports
// Bad: importing the entire openai package at module level for one route
import OpenAI from 'openai';

// Better: lazy import inside the handler
export async function POST(req: Request) {
  const { default: OpenAI } = await import('openai');
  const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
  // ...
}

// Also: Next.js route segments config
export const dynamic = 'force-dynamic';
export const runtime = 'nodejs'; // vs 'edge' — pick the right runtime

// Check bundle size: npx @next/bundle-analyzer

Cold starts are not a reason to leave Vercel. They are a reason to audit your bundle. In most cases, reducing imports from 15 packages to 3 per route drops cold starts from 2s to 400ms. The genuine migration signal is when your cold start P99 stays above 1.5s after optimization and your P99 SLA requires better than that.

Breaking point 2: the 60-second function limit for synchronous workflows. Vercel functions timeout at 60 seconds on Pro (300 seconds on Enterprise). PDF generation, report compilation, large file processing, and LLM chains regularly exceed this. The symptom is HTTP 504s on specific endpoints that work fine in development.

This is not a Vercel limit to work around — it is a design signal. The operation belongs in a background worker (Railway or Fly.io), not in the request handler. The route returns 202 with a job ID. The client polls. See Part 1 of the agents-in-production series for the exact queue pattern.

Railway: the CPU burst behavior

Railway's pricing model is usage-based: you pay for vCPU-seconds and memory-GB-seconds actually consumed. On the Starter plan, Railway allocates shared CPU with burst capability. "Burst" means your worker gets full CPU when the shared pool has headroom, and gets throttled when it does not.

For a background worker that processes 20 jobs a day, burst is fine — you almost never compete for CPU. For a worker handling sustained load — 200 jobs per hour, a cron that runs heavy computation every 15 minutes — you will see CPU-throttled processing that is 3-5x slower than your staging environment, which runs burst uncontested.

// Diagnosing Railway CPU throttling
// Add timing to your worker to catch the slowdown
const start = process.hrtime.bigint();
const result = await heavyProcessing(job.data);
const durationMs = Number(process.hrtime.bigint() - start) / 1_000_000;

if (durationMs > EXPECTED_MAX_MS * 2) {
  logger.warn('worker.slow', {
    jobId: job.id,
    durationMs,
    expectedMaxMs: EXPECTED_MAX_MS,
    // If consistently slow during peak hours → CPU throttling
    // If randomly slow → likely I/O (database, external API)
  });
}

// Fix for Railway: upgrade to dedicated CPU ($5-10/month extra per service)
// or move to Fly.io Machines (persistent VMs, dedicated compute)

The migration trigger for Railway is when a dedicated CPU add-on still does not meet your throughput requirements, or when you need persistent disk (Railway has no persistent volumes on the Starter plan, only ephemeral storage). Fly.io handles both.

The Upstash signal: region latency

Upstash Redis is serverless and regionally deployed. Your Vercel functions run at the edge — potentially in a region that is 80ms away from your Upstash instance. For a queue that enqueues once and processes async, this is irrelevant. For a cache that is hit on every request in a hot path, an 80ms round trip to Redis defeats the purpose of the cache.

// Measure your actual Upstash latency from Vercel edge
export async function GET(req: Request) {
  const start = Date.now();
  const value = await redis.get('test-key');
  const latencyMs = Date.now() - start;

  // Under 5ms: same region, cache is helping
  // 40-100ms: different region, cache may be adding latency net
  // Over 100ms: explicitly choose region in Upstash dashboard to match Vercel

  return Response.json({ latencyMs, region: process.env.VERCEL_REGION });
}

// Fix: in Upstash dashboard, pin your database to the same AWS region
// as your primary Vercel deployment (usually us-east-1 or eu-west-1)

Reading the signals together

The important pattern: none of these signals individually means "migrate to AWS." Most of them have a configuration fix that buys another six to twelve months. The migration trigger is when you find yourself spending engineering time on infrastructure configuration instead of product, and the fixes are no longer restoring headroom — just delaying the next incident.

// Signal matrix — what each signal actually means

Supabase connection limit:
  First hit:    Config fix (enable PgBouncer pooler) — 30 minutes
  Second hit:   Upgrade Supabase compute tier ($50-150/month) — 1 day
  Third hit:    Investigate query patterns and connection leaks — 1 week
  Migration:    If still hitting limits with compute upgraded and pooler enabled

Vercel cold start P99 > 1.5s:
  First:        Bundle audit — lazy imports, tree shaking — 2 days
  Second:       Move CPU-heavy paths to Railway/Fly.io workers — 3 days
  Migration:    If business SLA requires sub-200ms P99 for all routes

Railway CPU throttle:
  First:        Upgrade to dedicated CPU ($5-10/month) — 10 minutes
  Migration:    If dedicated CPU is still throttled or persistent disk needed

Monthly platform cost > $600:
  Evaluate:     What would self-managed cost? (include engineering time)
  Rule of thumb: At $600/month, AWS is likely cheaper if you have DevOps bandwidth

The decision to migrate is almost never "we hit a technical limit." It is "we are spending engineering time managing infrastructure that we should be spending on product, and the cost savings of self-managed now justify hiring for or learning DevOps."

The practical trigger most companies hit: $5,000-10,000/month in combined platform costs. At that number, the equivalent AWS setup is $800-1,500/month. The cost difference funds a part-time DevOps contractor to manage it.

What part 3 covers

Part 3 is the actual migration playbook: the order to migrate components in (database last, not first), Fly.io as the intermediate step for workers, the strangler fig pattern for moving traffic off Railway, the real cost comparison at each ARR stage, and the one thing on this stack you should never migrate — Vercel's edge network.