July 4, 2026Architecture11 min read

From Vercel to AWS, part 3: the migration playbook

Published July 4, 202611 min read

A founder asked the wrong question: "Can we stay on Supabase?" That is a technical question. The real question is financial: "What is the cost per engineer-hour of managing our own Postgres versus what we are paying Supabase?" At $25/month, it is a clear no-brainer — stay on Supabase. At $500/month, you are still probably better off staying. At $2,000/month, you do the math. At $6,000/month, the AWS equivalent is $800/month, and you are funding a part-time DevOps contractor to manage it.

This is part 3 of the Vercel to AWS series. Parts 1 and 2 covered what to build on and when the signals tell you it is time to consider moving. Part 3 is the playbook: the exact order to migrate components, how to use Fly.io as an intermediate step, the strangler fig pattern for background workers, the real cost calculation at different ARR stages, and the one part of this stack you should probably never migrate.

The migration decision is a cost calculation, not a technical threshold

Most infrastructure migrations are triggered by hitting a hard limit. A table hits the row count ceiling. A queue cannot process fast enough. The database runs out of disk. These triggers feel clean but they are rare. The actual migration trigger for startups on modern dev platforms is a cost-per-capability comparison that only makes sense at a specific ARR level.

// Cost comparison at different ARR stages
// Assumes: Next.js app, PostgreSQL, 1-2 background workers, moderate traffic

ARR: $50k (~200 users)
  Dev Platform (Vercel + Supabase + Railway): $95-120/month
  Self-managed AWS minimum (RDS t4g.small, ECS, ALB, NAT): $280/month + setup time
  Verdict: Stay on dev platforms — AWS is MORE expensive and slower to ship

ARR: $300k (~1,000 users)
  Dev Platform: $400-600/month
  Self-managed AWS: $400-600/month (breaks even)
  But: AWS management now requires 2-4 hrs/week of engineering time
  Verdict: Still favor dev platforms unless you have DevOps bandwidth

ARR: $800k (~3,000 users)
  Dev Platform: $1,500-3,000/month
  Self-managed AWS: $600-900/month
  Savings: $10k-25k/year
  Verdict: AWS makes sense if you have (or can hire for) DevOps

ARR: $2M+ (~8,000 users)
  Dev Platform: $5,000-10,000/month
  Self-managed AWS: $1,500-2,500/month
  Savings: $40k-90k/year → funds a part-time DevOps contractor
  Verdict: Migrate

One cost that does not appear on the AWS invoice: engineering time. A Supabase Pro database is zero maintenance. An RDS instance running PostgreSQL requires someone who knows how to configure maintenance windows, set up automated backups, understand parameter groups, tune connection pooling at the infrastructure level, and respond when a snapshot restore is needed at 2 AM. That is worth real money. Include it in your calculation.

The migration order: workers first, database last

Most teams migrate in the wrong order: they start with the database because it feels like the foundation, then discover the migration is a 3-month project that risks data loss, and either abandon the effort or rush a risky cutover. The correct order is almost the opposite.

The migration sequence that works:

1. Background workers: Railway → Fly.io. Start here. Workers have the lowest risk profile of anything in your stack: they are stateless, they connect to your existing Supabase database and Upstash queue over the public internet, and a failed worker rollout does not affect your user-facing application at all. If Fly.io is misconfigured, jobs back up in the queue and you roll back. No data is at risk.

# fly.toml — first migration target
app = "yourapp-worker"
primary_region = "iad"  # us-east (match your Supabase region)

[build]
  dockerfile = "workers/Dockerfile"

[env]
  NODE_ENV = "production"
  WORKER_CONCURRENCY = "5"

[[services]]
  # Workers don't need inbound HTTP, they pull from Upstash
  # No [[services.ports]] needed for a pure worker

[mounts]
  source = "worker_data"
  destination = "/data"  # persistent disk — Railway's missing feature

# Scale up when the queue is deep, down when idle
[[autoscaling]]
  min_machines_running = 1
  max_machines_running = 5
  processes = ["app"]

Fly.io machines are persistent VMs, not serverless functions. They start in ~500ms, have dedicated CPU, and support persistent disk. For a background worker that needs consistent CPU and state between job runs, this is the right primitive. Railway's shared CPU burst that was causing 3-5x slowdowns in Part 2 does not exist on Fly.io dedicated.

Run Railway and Fly.io in parallel for two weeks. Both workers pull from the same Upstash queue. If Fly.io behaves correctly, disable the Railway worker. Your total migration cost: one afternoon to write the fly.toml and Dockerfile, two weeks of monitoring.

2. API layer: Vercel stays (for now, and possibly forever). This is the counterintuitive part of the playbook. Vercel's edge infrastructure — global CDN, serverless function deployment, zero-config SSL, automatic preview environments, ISR caching — is genuinely hard to replicate. Replacing it with CloudFront + API Gateway + Lambda requires significant infrastructure as code effort and produces a worse developer experience.

The math on migrating off Vercel almost never pencils out at startup scale. Vercel Pro at $20/month (team pricing is higher, but still) versus the complexity of CloudFront + WAF + API Gateway + Lambda + S3 + ACM — the AWS equivalent is $50-200/month plus 3-5 weeks of setup. Keep Vercel. If you eventually reach Vercel Enterprise pricing territory ($400+/month per team), revisit.

3. Database: Supabase → RDS. Do this last, not first. The database migration is the highest risk and requires the most preparation. Two approaches work:

// Database migration: the shadow migration pattern
// Run your new RDS instance in parallel for 2-4 weeks before cutover

Step 1: Provision RDS (terraform)
# rds.tf
resource "aws_db_instance" "main" {
  engine                 = "postgres"
  engine_version         = "16.3"
  instance_class         = "db.t4g.medium"
  allocated_storage      = 100
  storage_encrypted      = true
  db_name                = "yourapp_prod"
  username               = "app"
  password               = var.db_password
  multi_az               = true          # failover replica
  backup_retention_period = 7
  deletion_protection    = true

  # Match the postgres version Supabase is running
  parameter_group_name   = "default.postgres16"
}

Step 2: Stream existing data (pg_dump + restore)
pg_dump -Fc --no-acl --no-owner   postgresql://postgres.abc:pw@aws-0-us-east-1.pooler.supabase.com:5432/postgres   > /tmp/supabase-dump.dump

pg_restore -d postgresql://app:pw@rds-host.amazonaws.com:5432/yourapp_prod   /tmp/supabase-dump.dump

Step 3: Enable logical replication to keep RDS in sync
-- On Supabase (enable in dashboard: Database → Replication)
-- Supabase Pro supports wal_level = logical
-- Use AWS DMS or pglogical to stream changes

Step 4: Shadow test for 2 weeks
# Route 10% of read queries to RDS, write to Supabase
# Compare query results. Fix any discrepancies.

Step 5: Cutover
# Update DATABASE_URL in all services to RDS
# Monitor for 24h
# Disable Supabase logical replication source
# Downgrade Supabase (keep for 30 days as backup, then cancel)

Fly.io as the intermediate step: why it matters

Many migration guides jump straight from Railway to ECS. Fly.io as an intermediate step exists for a reason: it gives you persistent VMs, private networking between services, and predictable CPU without requiring you to learn IAM roles, ECS task definitions, target groups, and security groups all at once.

The Fly.io to ECS migration, when you eventually need it, is mechanical:

// Fly.io worker → ECS Fargate: direct translation
// fly.toml                        → ECS task definition
// primary_region = "iad"          → aws_region = "us-east-1"
// [[mounts]] source = "data"      → EFS volume mount
// max_machines_running = 5        → desired_count + auto-scaling policy
// fly secrets set KEY=value       → AWS Secrets Manager + ECS env vars
// fly deploy                      → aws ecr push + aws ecs update-service

# The Docker image is identical. The runtime environment changes.
# Expected migration time: 1-2 days per worker service

You need ECS when: you need fine-grained VPC networking (workers calling internal RDS without going through a public endpoint), you need GPU instances for model inference, or your workers need to be inside the same private network as your RDS instance for compliance reasons. For most startups, Fly.io with private networking handles this without the ECS complexity.

The connection between this migration and your LLM infrastructure

The agents-in-production series covered the architecture of LLM workloads at three scales. At each scale, the infrastructure host matters.

At startup scale (Part 1 of that series): the LLM call queue and worker live on Railway or Fly.io. The architecture applies equally to both. The 60-second Vercel function limit makes the async worker pattern non-optional, and Part 1's Railway setup gives you that.

At scale-up (Part 2): the StepRunner with Redis checkpoints runs on a Railway or Fly.io worker. Upstash Redis handles the checkpoint storage. When you hit Railway's CPU throttling during heavy multi-step pipeline runs, you are now looking at the same Fly.io migration described here.

At enterprise scale (Part 3): the LLM gateway is a service in its own right — with per-team rate limits, attribution headers, and cost dashboards. That gateway does not belong on Fly.io. It belongs on ECS, in the same VPC as your other services, with auto-scaling, and internal-only routing. The migration from Fly.io to ECS is when the gateway transitions from "a service one team runs" to "platform infrastructure the whole company depends on."

What to never migrate

One answer to this is obvious by now: Vercel's CDN and edge infrastructure. The others are worth naming explicitly.

Resend. Building a reliable email delivery system — IP warm-up, bounce handling, SPF/DKIM/DMARC configuration, deliverability monitoring, feedback loop processing — is a full-time job. Resend at $20-90/month is a bargain at any ARR level where email matters. Migrate to SES or Postmark only if you need volume pricing at millions of emails per month.

Upstash for queues and caching. The serverless Redis model is correct for distributed applications with variable load. ElastiCache is $50-200+/month for an always-on instance that you provision regardless of load. For a job queue that processes nothing overnight, Upstash at $1-10/month is a better fit. The migration to ElastiCache makes sense when Redis is a hot path for synchronous user requests at scale, not for async job queues.

Clerk (if you use it). Auth is one of the few infrastructure categories where self-managed is genuinely worse across the board — security, maintenance burden, feature velocity. Cognito and Auth0 compete here. Clerk stays unless you have a specific reason to move.

The checklist before you migrate

// Pre-migration checklist — if any item is no, stop until it is yes

[ ] Monthly platform cost > $3,000 OR specific technical limit cannot be config-fixed
[ ] You have (or can hire) someone who can manage Postgres at the infrastructure level
[ ] You have working Terraform or CloudFormation for your target AWS resources
[ ] You have a rollback plan: dev platform stays live until parallel-run passes
[ ] Database schema version is clean — no pending migrations, no orphaned tables
[ ] Application secrets are in environment variables, not hardcoded
[ ] You have observability: logs, metrics, alerting before migration (not after)
[ ] You have documented your current architecture (worker count, queue names, DB tables)
[ ] Your engineers have AWS IAM and VPC basics — not expert, just basics
[ ] You have 6 weeks for the migration, not 2 (database migrations take longer than planned)

The most common migration failure: a team migrates because they want to, not because the cost calculation supports it. They spend six weeks moving off Supabase at $150/month, land on RDS at $180/month after accounting for the NAT Gateway and backup storage, and realize they saved nothing and added operational burden.

Migration is correct when the cost savings are real, the team has the DevOps capability, and the technical limits have genuinely been exhausted. In that order.