The Shared State Trap: How a FastAPI 'Optimisation' Leaked User Data
We replaced Flask's request-scoped g with a plain module-level dict during
our FastAPI migration. It worked in tests. It worked in staging. In production, under
concurrent load, it silently served one tenant's data to a completely different user.
For three days. The first signal was a support ticket with a screenshot.
The rewrite nobody questioned
Four years into running a Flask 1.x reporting API, we decided to rewrite it in FastAPI. The pitch was sound. Native async for slow I/O, Pydantic validation, OpenAPI docs that might actually stay in sync with reality. Management approved. Engineering was excited. Two sprints later tests were green, slow endpoints were 40% faster in load tests, and we deployed on a Wednesday.
The migration looked like a success. Dashboards green. Seventy-two hours later, the support tickets started coming in.
Wrong data, zero errors
"I'm seeing reports that don't belong to my company."
First ticket, we dismissed as a frontend cache glitch. Second one made us nervous. The third had a screenshot and confirmed a multi-tenant data leak. Users were getting valid, well-formed API responses with HTTP 200 status codes and data that belonged to a different organisation.
The logs were completely clean. No exceptions. No 500s. No suspicious query patterns. No latency spikes. Just a steady stream of healthy 200 responses containing the wrong org's data. From a monitoring perspective, there was nothing to alert on.
I spent the next afternoon adding deep instrumentation. The org_id from
the JWT at auth time. The org_id passed to each database query. The
org_id on every returned row. Deployed and waited. When the next incident
hit, the log line read:
auth.org_id=2041 → query.org_id=2041 → result.org_id=1038
Auth was correct. Query filter was correct. The data that came back wasn't. Which
meant either the database was lying (unlikely) or the org_id my logging
statement was reading wasn't the same org_id my query was reading. I sat
with that for a minute before I understood what it meant.
The "Optimisation" that broke everything
In the old Flask codebase, we used flask.g extensively. Flask's
request-scoped proxy for storing per-request data through the life of a request. It's
how we avoided threading context (org ID, user ID, request metadata) through every
function signature in the codebase. Convenient, idiomatic Flask, and ran fine for
four years.
During the FastAPI migration, one of the team replaced flask.g with what
seemed like an equivalent: a module-level dictionary. Cleaner, they thought. No Flask
import, more "Pythonic." I reviewed the PR. I didn't flag it either.
# Looked harmless. Was catastrophic.
_request_context: dict = {}
def set_context(org_id: int, user_id: int) -> None:
_request_context["org_id"] = org_id
_request_context["user_id"] = user_id
def get_org_id() -> int:
return _request_context.get("org_id")
# Used in the route handler:
@router.get("/reports/{report_id}")
async def get_report(
report_id: int,
token: TokenData = Depends(verify_token),
):
set_context(token.org_id, token.user_id) # Set context for this "request"
await asyncio.sleep(0) # Yield to event loop (batching)
report = await fetch_report(report_id) # Calls get_org_id() internally
return report
In Flask, this pattern is safe. Flask uses Werkzeug's LocalProxy backed by
threading.local(). With thread-per-request, each thread has its own
isolated copy of any thread-local variable. Flask's g is inherently scoped
to one request, one thread.
FastAPI is different. It runs on an async event loop. One OS thread handles thousands
of concurrent requests. That module-level _request_context dict is one
object in memory, shared across every concurrent coroutine. When two requests write to
the same keys, the last write wins and whoever reads next gets the wrong value.
How the corruption happens
To understand why this fails, you need to see how Python's async event loop interleaves
coroutines. When a coroutine hits an await, it yields control back to the
event loop, which picks up another coroutine. Cooperative scheduling is why async is
fast. It's also why shared mutable state is a trap.
BROKEN: Module-level dict, two concurrent requests
Time │ Request A (org=2041) Request B (org=1038)
─────┼──────────────────────────────────────────────────────
t1 │ set_context(org_id=2041)
│ _request_context = {"org_id": 2041}
t2 │ await asyncio.sleep(0) ──────► yields to event loop
t3 │ set_context(org_id=1038)
│ _request_context = {"org_id": 1038}
t4 │ await db.fetch(...) ──► yields
t5 │ ◄────────────────────────────── event loop resumes A
t6 │ get_org_id()
t7 │ returns 1038 ✗ ← B overwrote A's key!
t8 │ query: WHERE org_id = 1038
t9 │ → org 1038's data returned to org 2041's user
_request_context = {"org_id": 1038}
─────────────────
One shared dict. All requests.
Any await is a potential interleave point. Our handler set the context,
then immediately awaited something (a cache lookup, a DB call, sometimes just
asyncio.sleep(0) for batching). In that window, another request could
write to the same dict. When the first request resumed, it read the wrong org ID,
queried with the wrong filter, and returned the wrong tenant's data.
Under low load, the timing almost never aligned. Under production load with dozens of concurrent requests, it happened constantly. Responses were structurally valid (correct JSON shape, HTTP 200, real data) so no automated monitor caught anything. There was nothing to catch. From the system's perspective, everything was working.
The fix: contextvars
Python 3.7 introduced contextvars, which is built for exactly this problem.
A ContextVar is automatically scoped to the current async task (or OS
thread). Each coroutine gets its own isolated binding. It's the async-native equivalent
of thread-local storage, and it works correctly across await boundaries.
from contextvars import ContextVar
from typing import Optional
# Each async task gets its own isolated copy of these values.
# ContextVar is safe across await boundaries — no shared state.
_org_id_var: ContextVar[Optional[int]] = ContextVar("org_id", default=None)
_user_id_var: ContextVar[Optional[int]] = ContextVar("user_id", default=None)
def set_context(org_id: int, user_id: int) -> None:
_org_id_var.set(org_id)
_user_id_var.set(user_id)
def get_org_id() -> int:
org_id = _org_id_var.get()
if org_id is None:
raise RuntimeError("org_id not set — is set_context() missing from this path?")
return org_id
def get_user_id() -> int:
user_id = _user_id_var.get()
if user_id is None:
raise RuntimeError("user_id not set — is set_context() missing from this path?")
return user_id
When Request A calls _org_id_var.set(2041), Python's async runtime stores
that binding in A's execution context, which is a lightweight namespace the event loop
maintains per coroutine. When Request B calls _org_id_var.set(1038), it
writes to B's context. The two never touch.
FIXED: ContextVar, two concurrent requests
Time │ Request A (org=2041) Request B (org=1038)
─────┼──────────────────────────────────────────────────────
t1 │ _org_id_var.set(2041)
│ Context A: { _org_id_var → 2041 }
t2 │ await asyncio.sleep(0) ──────► yields to event loop
t3 │ _org_id_var.set(1038)
│ Context B: { _org_id_var → 1038 }
t4 │ await db.fetch(...) ──► yields
t5 │ ◄────────────────────────────── event loop resumes A
t6 │ _org_id_var.get()
t7 │ returns 2041 ✓ ← reads from A's own context
t8 │ query: WHERE org_id = 2041
t9 │ → org 2041's data returned to org 2041's user ✓
Context A: { _org_id_var: 2041 } ← isolated
Context B: { _org_id_var: 1038 } ← isolated
One import swap. One class change. That was the entire fix. The damage it caused took considerably longer to clean up.
An honest post-mortem
We ran a full audit of every affected request. Three days of logs, cross-referenced against support tickets and org ID mismatches in our access logs. We identified seventeen tenants who had received at least one response containing another tenant's data. We disclosed to every one of them individually, revoked the affected report exports, and filed a GDPR incident report.
The disclosure calls were some of the most uncomfortable conversations I've had with clients. The data involved wasn't especially sensitive (aggregated analytics, not financial records or PII), but that didn't really soften anything. Data isolation is a contract.
What we changed after
Beyond the immediate fix, we made three structural changes to prevent a recurrence:
-
We deprecated the context helpers entirely on new endpoints.
org_idanduser_idare now injected via FastAPI'sDepends()system as typed parameters. Every function that needs the org ID receives it explicitly. The data flow is visible in every signature instead of hidden in a global. - We added cross-tenant isolation tests. Integration tests fire two concurrent requests for different orgs and assert each response contains only data belonging to the requesting org. They run in CI on every PR, took about three hours to write, and would have caught this bug in staging immediately.
-
We added a custom Pylint rule that flags any mutable module-level dict or list
inside
services/. Module-level state is fine for config and constants, not for per-request data. The linter makes the distinction enforced instead of advisory.
The broader lesson
The mistake wasn't carelessness. The developer who introduced it was experienced. The pattern of storing request context in a "global" is completely normal in Flask, Django, and every other thread-per-request framework. It's how you avoid prop-drilling context through twenty function signatures. For four years it had worked fine.
The problem was translating a thread-safe pattern into an async context without understanding what made it thread-safe in the first place.
Flask'sgisn't just a dict. It's backed byLocalProxy, which wrapsthreading.local(). The safety is invisible unless you've read the source. When we copied the pattern without copying the mechanism, we got all of the convenience and none of the isolation.
Migrating from sync to async, every piece of "ambient" state deserves a hard look. Thread-local storage, request-local proxies, singleton caches. They all behave differently when the execution model changes. What was safe in thread-per-request can become a data leak in async.
If you're running FastAPI and passing context through your call chain via anything
other than explicit parameters or ContextVar, go audit it today. Not
tomorrow. I'm serious. Silent data leaks wait for the right concurrency timing, and
then they show up in a support ticket with a screenshot.