FastAPI 104: Rate Limiting & Throttling — Token Buckets, Sliding Windows & Redis
FastAPIProduction

FastAPI 104: Rate Limiting & Throttling — Token Buckets, Sliding Windows & Redis

April 1, 202612 min readPART 04 / 18

In Part 3 we covered API design — cursor pagination, Pydantic filtering, and consistent error shapes. Now we're adding the defensive layer that keeps your API alive when clients misbehave: rate limiting. A bug loop, a bot scraper, or a viral launch can all saturate your connection pool within seconds. The question isn't whether to rate limit — it's which algorithm, where to enforce it, and whether your limits actually hold across workers. This is Part 4.

Why rate limiting is not optional

Your FastAPI app talks to Postgres. Postgres has a connection pool — let's say 20 connections. One client sends 500 requests in a second. Your pool exhausts. Every other user gets a timeout. You've taken down your service for everyone because of one bad actor (or one developer who forgot to add a delay in a test script).

Rate limiting protects three things:

  • Infrastructure — DB connections, memory, CPU
  • Fairness — one client can't crowd out others
  • Security — brute force protection on auth endpoints, scraping prevention on data endpoints

At Staff level, you're expected to know the trade-offs between algorithms, the failure modes of naive implementations, and how to communicate limits clearly to API consumers.

Algorithm 1: token bucket

Each client gets a virtual bucket that holds N tokens. Every request consumes one token. Tokens refill at a fixed rate R per second (not all at once — steadily). If the bucket is empty, the request is rejected with a 429. The bucket never exceeds capacity.

Bucket capacity: 10 tokens | Refill rate: 2 tokens/sec

t=0s:  bucket=10
       → 10 requests burst through      ✅  bucket=0
t=1s:  bucket=2  (refilled 2 tokens)
       → 2 requests through             ✅  bucket=0
       → 1 more request                 ❌  429
t=2s:  bucket=2 again…

Token bucket is burst-friendly. A client that's been idle can use their accumulated tokens all at once. This is the right model for upload endpoints, search, or any endpoint where occasional bursts are expected. The sustained rate (refill rate) is what's actually enforced long-term.

Algorithm 2: sliding window counter

Tracks the count of requests in a rolling window (e.g. the last 60 seconds). More precise than token bucket and closes the "boundary attack" that breaks fixed windows.

The boundary attack on fixed windows:

Window: 60 seconds | Limit: 100 req/window

Fixed window resets at :00 and :60 of every minute.

A client sends:
  99 requests at t=0:59  ✅  (within first window)
  99 requests at t=1:01  ✅  (within second window)
  → 198 requests in 2 seconds.

Your limit is 100/min, but they got 198/2sec.

Sliding window solves this by always measuring from "now minus window size":

At t=1:01, the window looks back to t=0:01.
The 99 requests at t=0:59 are still in-window.
Combined with the 99 new ones = 198 → exceeds limit.
1:01 requests are rejected. ✅

Implementation uses a Redis sorted set where the score is the Unix timestamp of each request. To count the window: remove entries older than (now - window), then count what's left.

Implementation: sliding window in FastAPI middleware

import time
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
import redis.asyncio as aioredis

app = FastAPI()
redis_client: aioredis.Redis = None

@app.on_event("startup")
async def startup():
    global redis_client
    redis_client = aioredis.from_url("redis://localhost:6379")

async def check_rate_limit(
    r: aioredis.Redis,
    client_id: str,
    limit: int = 100,
    window: int = 60,
) -> tuple[bool, int]:
    key = f"ratelimit:{client_id}"
    now = time.time()

    async with r.pipeline() as pipe:
        # Remove entries outside the window
        pipe.zremrangebyscore(key, 0, now - window)
        # Record this request (score = timestamp, member = unique str)
        pipe.zadd(key, {f"{now}-{id(object())}": now})
        # Count requests in the window
        pipe.zcard(key)
        # Auto-expire the key after window duration (cleanup)
        pipe.expire(key, window)
        results = await pipe.execute()

    count = results[2]
    remaining = max(0, limit - count)
    return count <= limit, remaining

@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
    # Use authenticated user ID if available, fall back to IP
    client_id = getattr(request.state, "user_id", None) or request.client.host
    allowed, remaining = await check_rate_limit(redis_client, client_id)

    if not allowed:
        return JSONResponse(
            {"error": "rate_limit_exceeded", "message": "Too many requests"},
            status_code=429,
            headers={
                "X-RateLimit-Limit": "100",
                "X-RateLimit-Remaining": "0",
                "Retry-After": "60",
            }
        )

    response = await call_next(request)
    response.headers["X-RateLimit-Limit"] = "100"
    response.headers["X-RateLimit-Remaining"] = str(remaining)
    return response

One critical detail: the zadd member must be unique per request. If two requests arrive at the same millisecond with the same score, Redis will deduplicate them (same member = update, not insert). Using f"{now}-{id(object())}" guarantees uniqueness.

Where to enforce: gateway vs middleware vs app code

You have three places to put rate limiting, each with different trade-offs:

Client Request
     │
     ▼
┌─────────────────────┐
│  API Gateway/nginx  │  ← cheapest, before your app sees it
│  (AWS ALB, Kong)    │    can't see auth context easily
└─────────────────────┘
     │
     ▼
┌─────────────────────┐
│  FastAPI Middleware  │  ← per-user, per-plan, full request context
│  (our example above) │    requires Redis for multi-worker consistency
└─────────────────────┘
     │
     ▼
┌─────────────────────┐
│  Application Code   │  ← most flexible (per-endpoint, business rules)
│  (route handler)    │    easy to miss endpoints, most expensive
└─────────────────────┘

The right answer in most production systems is layered: a global limit at the gateway (protect against total volume), and per-user/per-plan limits in middleware (fairness and business rules).

Per-user vs per-plan: the SaaS pattern

In a SaaS API, limits aren't just about protection — they're a product feature. Free plan users get 1,000 requests/day; Pro users get unlimited. Here's how the request lifecycle looks:

async def get_rate_limit_for_user(user_id: str, db) -> int:
    # Cache plan limits in Redis to avoid DB on every request
    cached = await redis_client.get(f"plan:{user_id}")
    if cached:
        return int(cached)

    user = await db.get_user(user_id)
    limit = 1_000 if user.plan == "free" else 1_000_000
    await redis_client.setex(f"plan:{user_id}", 300, limit)  # 5min cache
    return limit

@app.middleware("http")
async def tiered_rate_limit(request: Request, call_next):
    user_id = getattr(request.state, "user_id", None)
    if not user_id:
        return await call_next(request)  # Unauthed requests handled by auth middleware

    limit = await get_rate_limit_for_user(user_id, request.state.db)
    allowed, remaining = await check_rate_limit(
        redis_client, user_id, limit=limit, window=86400  # 24h window
    )
    if not allowed:
        return JSONResponse({"error": "daily_limit_exceeded"}, 429)

    return await call_next(request)

Notice that this middleware runs after auth middleware — request.state.user_id is set by auth. Rate limiting before auth means you're limited by IP, which bots trivially rotate around.

Response headers: the contract with your clients

Good rate limit headers let clients implement adaptive throttling — slowing down before hitting the limit instead of hammering until they get 429s. These are the standard headers:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 43
X-RateLimit-Reset: 1743523200

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1743523200
Retry-After: 17

Retry-After is the critical one on 429. Without it, every client's retry loop fires immediately. You've rejected their request, they retry instantly, you reject again — a feedback loop that makes your server load worse. With Retry-After: 17, compliant clients wait 17 seconds before retrying.

Common mistakes summary

  • In-process counters. Each Uvicorn worker has its own memory. Four workers means four independent counters — your effective limit is 4× what you intended. Always use Redis (or another external store) as the single source of truth.
  • Rate limiting before authentication. You're keying on IP address, which a bot rotates every few requests. Auth first, then rate limit by user ID — the identity that actually matters.
  • No Retry-After on 429. Clients retry immediately. You've turned a rate-limit event into a request storm. The header costs you nothing to add.
  • Fixed window only. Boundary attacks let adversaries get 2× quota in a 2-second span across a window reset. Use sliding windows for any security-sensitive endpoint.
  • Same limit on every endpoint. /api/login and /api/products have completely different risk profiles. Login needs 5–10 attempts/minute max (brute force); a product list endpoint can safely absorb much higher rates.
  • No X-RateLimit-Remaining. Well-behaved API clients (SDKs, mobile apps) use this to proactively slow down before hitting zero. Without it, they can't implement graceful backoff.
  • Not making the Redis pipeline atomic. Each pipeline call in the example above uses a single round-trip. Separate individual commands introduce race conditions under concurrent load.

Part 4 done. Next up — Part 5: Caching Strategies & Redis Patterns. Cache-aside vs write-through vs write-behind, cache invalidation strategies, thundering herd, and when caching makes consistency problems worse.

← PREV
FastAPI 103: API Design — Pagination, Filtering & Error Handling at Scale
← All FastAPI Posts