FastAPIProduction

FastAPI 104: Rate Limiting & Throttling, Token Buckets, Sliding Windows & Redis

April 1, 202612 min readPART 04 / 18

Part 3 was API design: cursor pagination, Pydantic filtering, consistent error shapes. Part 4 adds the defensive layer that keeps your API alive when clients misbehave. A bug loop, a bot scraper, or a viral launch will saturate your connection pool in seconds. The real question is which algorithm, where to enforce it, and whether your limits actually hold across workers.

Why rate limiting is not optional

Your FastAPI app talks to Postgres. Postgres has a connection pool, say 20 connections. One client sends 500 requests in a second. Your pool exhausts. Every other user gets a timeout. You've taken down your service for everyone because of one bad actor, or one developer who forgot to add a delay in a test script. I have been both of those people.

Rate limiting protects three things:

Infrastructure. DB connections, memory, CPU.
Fairness. One client shouldn't crowd out everyone else on the node.
Security. Brute force protection on auth, scraping protection on data endpoints.

At staff level you're expected to know the trade-offs between algorithms, the failure modes of naive implementations, and how to communicate limits back to API consumers.

Algorithm 1: token bucket

Each client gets a virtual bucket that holds N tokens. Every request consumes one token. Tokens refill steadily at R per second. If the bucket is empty, the request is rejected with a 429. The bucket never exceeds capacity.

Bucket capacity: 10 tokens | Refill rate: 2 tokens/sec

t=0s:  bucket=10
       10 requests burst through      allowed  bucket=0
t=1s:  bucket=2  (refilled 2 tokens)
       2 requests through             allowed  bucket=0
       1 more request                 rejected 429
t=2s:  bucket=2 again...

Token bucket is burst-friendly. A client that's been idle can spend accumulated tokens in one go. It's the right model for anywhere occasional bursts are expected, like file uploads or search. The sustained rate (refill rate) is what's actually enforced long term.

Algorithm 2: sliding window counter

Tracks the count of requests in a rolling window (e.g. the last 60 seconds). More precise than token bucket and closes the boundary attack that breaks fixed windows.

Here's what the boundary attack looks like on a fixed window:

Window: 60 seconds | Limit: 100 req/window

Fixed window resets at :00 and :60 of every minute.

A client sends:
  99 requests at t=0:59  allowed  (within first window)
  99 requests at t=1:01  allowed  (within second window)
  Net: 198 requests in 2 seconds.

Your limit is 100/min, but they got 198/2sec.

Sliding window solves this by always measuring from "now minus window size":

At t=1:01, the window looks back to t=0:01.
The 99 requests at t=0:59 are still in-window.
Combined with the 99 new ones = 198, which exceeds the limit.
1:01 requests are rejected.

Implementation uses a Redis sorted set where the score is the Unix timestamp of each request. To count the window: remove entries older than (now - window), then count what's left.

Implementation: sliding window in FastAPI middleware

import time
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
import redis.asyncio as aioredis

app = FastAPI()
redis_client: aioredis.Redis = None

@app.on_event("startup")
async def startup():
    global redis_client
    redis_client = aioredis.from_url("redis://localhost:6379")

async def check_rate_limit(
    r: aioredis.Redis,
    client_id: str,
    limit: int = 100,
    window: int = 60,
) -> tuple[bool, int]:
    key = f"ratelimit:{client_id}"
    now = time.time()

    async with r.pipeline() as pipe:
        # Remove entries outside the window
        pipe.zremrangebyscore(key, 0, now - window)
        # Record this request (score = timestamp, member = unique str)
        pipe.zadd(key, {f"{now}-{id(object())}": now})
        # Count requests in the window
        pipe.zcard(key)
        # Auto-expire the key after window duration (cleanup)
        pipe.expire(key, window)
        results = await pipe.execute()

    count = results[2]
    remaining = max(0, limit - count)
    return count <= limit, remaining

@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
    # Use authenticated user ID if available, fall back to IP
    client_id = getattr(request.state, "user_id", None) or request.client.host
    allowed, remaining = await check_rate_limit(redis_client, client_id)

    if not allowed:
        return JSONResponse(
            {"error": "rate_limit_exceeded", "message": "Too many requests"},
            status_code=429,
            headers={
                "X-RateLimit-Limit": "100",
                "X-RateLimit-Remaining": "0",
                "Retry-After": "60",
            }
        )

    response = await call_next(request)
    response.headers["X-RateLimit-Limit"] = "100"
    response.headers["X-RateLimit-Remaining"] = str(remaining)
    return response

One sneaky detail: the zadd member must be unique per request. If two requests arrive in the same millisecond with the same score, Redis will silently deduplicate them (same member equals update, not insert). Using f"{now}-{id(object())}" guarantees uniqueness. I found this out the hard way by undercounting requests during a load test and blaming Redis.

Where to enforce: gateway vs middleware vs app code

You have three places to put rate limiting, each with different trade-offs:

Client Request
     │
     ▼
┌─────────────────────┐
│  API Gateway/nginx  │  cheapest, before your app sees it
│  (AWS ALB, Kong)    │  can't see auth context easily
└─────────────────────┘
     │
     ▼
┌─────────────────────┐
│  FastAPI Middleware  │  per-user, per-plan, full request context
│  (our example above) │  needs Redis for multi-worker consistency
└─────────────────────┘
     │
     ▼
┌─────────────────────┐
│  Application Code   │  most flexible (per-endpoint, business rules)
│  (route handler)    │  easy to miss endpoints, most expensive
└─────────────────────┘

The answer in most production systems I've worked on is layered: a global limit at the gateway to protect against raw volume, plus per-user or per-plan limits in middleware for fairness and business rules.

Per-user vs per-plan: the SaaS pattern

In a SaaS API, limits are also a product feature. Free plan users get 1,000 requests a day, Pro users get effectively unlimited. The lifecycle looks like this:

async def get_rate_limit_for_user(user_id: str, db) -> int:
    # Cache plan limits in Redis to avoid DB on every request
    cached = await redis_client.get(f"plan:{user_id}")
    if cached:
        return int(cached)

    user = await db.get_user(user_id)
    limit = 1_000 if user.plan == "free" else 1_000_000
    await redis_client.setex(f"plan:{user_id}", 300, limit)  # 5min cache
    return limit

@app.middleware("http")
async def tiered_rate_limit(request: Request, call_next):
    user_id = getattr(request.state, "user_id", None)
    if not user_id:
        return await call_next(request)  # Unauthed requests handled by auth middleware

    limit = await get_rate_limit_for_user(user_id, request.state.db)
    allowed, remaining = await check_rate_limit(
        redis_client, user_id, limit=limit, window=86400  # 24h window
    )
    if not allowed:
        return JSONResponse({"error": "daily_limit_exceeded"}, 429)

    return await call_next(request)

Notice this middleware runs after auth middleware. request.state.user_id is set by auth. Rate limiting before auth means you're limited by IP, which bots trivially rotate around.

Response headers: the contract with your clients

Good rate limit headers let clients implement adaptive throttling, slowing down before they hit the limit instead of hammering until they eat a 429. The standard shape:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 43
X-RateLimit-Reset: 1743523200

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1743523200
Retry-After: 17

Retry-After is the critical one on 429. Without it, every client's retry loop fires immediately. You rejected their request, they retry instantly, you reject again. That's a feedback loop that makes your server load worse. With Retry-After: 17, well-behaved clients wait 17 seconds before trying again.

Common mistakes summary

In-process counters. Each Uvicorn worker has its own memory. Four workers means four independent counters, so your effective limit is 4x what you intended. Use Redis (or another external store) as the single source of truth.
Rate limiting before authentication. You end up keying on IP address, which a bot rotates every few requests. Auth first, then rate limit by user ID.
No Retry-After on 429. Clients retry immediately. You've turned a rate-limit event into a request storm. The header costs nothing.
Fixed window only. Boundary attacks let adversaries get 2x quota in a 2-second span across a window reset. Use sliding windows for anything security-sensitive.
Same limit on every endpoint. /api/login and /api/products have completely different risk profiles. Login needs 5-10 attempts/minute max. A product list endpoint can safely absorb much higher rates.
No X-RateLimit-Remaining. Well-behaved SDKs and mobile apps use it to proactively slow down before they hit zero. Without it, they can't back off gracefully.
Not making the Redis pipeline atomic. Each pipeline call in the example above is a single round-trip. Separate individual commands introduce race conditions under concurrent load.

Part 4 done. Next: Part 5: caching strategies and Redis patterns. Cache-aside vs write-through vs write-behind, invalidation, thundering herd, and the cases where caching makes your consistency problems worse, not better.

← PREV

FastAPI 103: API Design, Pagination, Filtering & Error Handling at Scale