How a Redis Cache Key Missing One Field Leaked Client Data Across Tenants for 72 Hours
A Tuesday afternoon. A client emailed support saying their dashboard was showing the wrong company name. I assumed a display bug. Stale frontend cache, maybe a mismatched JOIN, something cosmetic. It took four hours to accept the real answer: for 72 hours one paying enterprise tenant had been reading another tenant's confidential project data, served by our own Redis cache, quietly, with no errors anywhere.
Production failure
The platform was a project management SaaS with multiple enterprise clients, each in their own
isolated workspace. Tenant isolation was enforced at the database layer: every query scoped by
tenant_id, every record owned by exactly one org. We'd audited this twice. The
database layer was clean.
What we hadn't audited was the caching layer.
Our Flask API cached expensive responses in Redis using keys like project:{project_id}.
Project IDs were auto-incrementing integers from Postgres. Tenant A's project #1041 and
Tenant B's project #1041 are two completely separate records. To Redis, they were
the same key.
Tenant A loaded their project first. Redis cached it as project:1041. Tenant B
loaded theirs 40 minutes later. Redis handed back Tenant A's data. Tenant B's UI rendered it
without complaining, because the shape matched and the fields matched. Only the content was
wrong, and the UI has no way of knowing that.
False assumptions
First instinct was frontend. The React app maintained local state, and the obvious suspect was a stale context or a component that didn't re-fetch on navigation. We spent 90 minutes in browser devtools before confirming the API itself was returning the wrong data.
Second instinct: the database query. We grepped every query file for missing
tenant_id filters. Found nothing. Ran the suspect query manually with both tenant
IDs. Each returned exactly the right rows. The database was correct.
Which, honestly, felt worse than finding a bug.
"The database is clean. The API returns the wrong data. If the query is right and the result is wrong, something between the query and the response is substituting the answer."
That sentence is what finally pointed us at the cache.
Reproducing the poisoning
Reproducing it took 8 minutes once we had the hypothesis. Two test tenant accounts, each with a fresh project so they'd end up sharing an auto-increment ID. Load Tenant A's project endpoint. Redis caches it. Switch auth header to Tenant B, hit the same endpoint. Redis returns Tenant A's payload. Confirmed.
POISONED REQUEST FLOW
─────────────────────────────────────────────────────────────
Tenant A — GET /api/projects/1041
┌─────────┐ ┌──────────────────┐ ┌────────────────┐
│ Client A │────▶│ Flask API │────▶│ Redis │
└─────────┘ │ │ │ MISS │
│ cache_key = │ │ │
│ "project:1041" │◀────│ SET project: │
│ │ │ 1041 = {A data}│
└──────────────────┘ └────────────────┘
│
▼
DB query WHERE id=1041 ← correct, returns A's row
Cache SET project:1041 ← keyed without tenant
40 minutes later — Tenant B — GET /api/projects/1041
┌─────────┐ ┌──────────────────┐ ┌────────────────┐
│ Client B │────▶│ Flask API │────▶│ Redis │
└─────────┘ │ │ │ HIT ✓ │
│ cache_key = │◀────│ "project:1041" │
│ "project:1041" │ │ = {A data} ⚠️ │
└──────────────────┘ └────────────────┘
│
▼
Returns Tenant A's data to Tenant B ← never touches DB
CORRECT FLOW (after fix)
─────────────────────────────────────────────────────────────
Tenant B — GET /api/projects/1041
┌─────────┐ ┌──────────────────────────────────────────┐
│ Client B │────▶│ cache_key = "project:{tenant_id}:1041" │
└─────────┘ │ = "project:tenant_b_uuid:1041" │
│ │
│ Redis MISS → DB query → cache SET │
│ Returns Tenant B's data ✓ │
└──────────────────────────────────────────┘
Root cause: cache keys built without tenant scope
The caching helper was written before multi-tenancy was added to the platform. When the tenant layer was bolted on later, the database queries were updated correctly. The cache key builder was never touched. Nobody remembered it existed.
# BEFORE — tenant-blind cache key
def get_project(project_id: int):
cache_key = f"project:{project_id}"
cached = redis.get(cache_key)
if cached:
return json.loads(cached)
row = db.execute(
"SELECT * FROM projects WHERE id = %s AND tenant_id = %s",
(project_id, g.tenant_id) # DB query is scoped correctly
).fetchone()
redis.setex(cache_key, 300, json.dumps(row)) # cache key is NOT
return row
# AFTER — tenant-scoped cache key
def get_project(project_id: int):
# Include tenant_id in the key — different tenants never share a cache entry
cache_key = f"project:{g.tenant_id}:{project_id}"
cached = redis.get(cache_key)
if cached:
return json.loads(cached)
row = db.execute(
"SELECT * FROM projects WHERE id = %s AND tenant_id = %s",
(project_id, g.tenant_id)
).fetchone()
if row is None:
return None # also: don't cache a None — that's a separate bug
redis.setex(cache_key, 300, json.dumps(row))
return row
# ALSO ADDED — cache key audit helper (run in CI)
CACHE_KEY_PATTERNS = {
"project": "project:{tenant_id}:{project_id}",
"member": "member:{tenant_id}:{member_id}",
"report": "report:{tenant_id}:{report_id}:{date_range}",
}
# Any cache SET that doesn't match a known pattern raises in staging
Architecture fix: tenant-scoped keys + cache audit layer
The immediate fix was prefixing every cache key with tenant_id. We picked the
tenant UUID over the integer PK specifically to prevent enumeration; an attacker who can
influence a cache key should not be able to guess another tenant's key by incrementing an int.
We also considered Postgres row-level security as the "real" fix, so it would be structurally
impossible for a query to return cross-tenant data even if the WHERE tenant_id
clause went missing. We want to get there. It's a schema migration and careful testing across
40+ query sites, though, and the cache key fix was safe and deployable inside an hour. RLS
is on the roadmap. It has been on the roadmap for a while now.
The second layer was a cache key registry. Every valid cache key shape lives in a central manifest. Any cache write in staging that uses an unregistered pattern raises. That turns "someone wrote a cache key without tenant scope" from a silent production bug into a CI failure, which is where this class of problem belongs.
CACHE KEY AUDIT IN CI PIPELINE
─────────────────────────────────────────────────────────────
Developer writes new cached endpoint
│
▼
cache.set("new_resource:{id}", data)
│
▼
CI: run cache_key_audit.py
│
├── Key matches registered pattern? ──▶ ✅ PASS
│
└── Key NOT in registry? ──────────────▶ ❌ FAIL
"Cache key 'new_resource:{id}'
missing tenant scope.
Register in CACHE_KEY_PATTERNS
or add tenant_id prefix."
Lessons learned
- Tenant isolation is not just a database concern. Every layer that persists or caches data (Redis, CDN, in-memory stores, even log aggregators) needs to be audited for tenant scope when you add multi-tenancy.
- Auto-increment IDs across tenants will eventually collide. If you use integer PKs, two tenants will at some point have the same resource ID. Cache keys have to carry the tenant identifier.
- Silent correctness failures are worse than crashes. This ran for 72 hours with zero error rates, zero latency spikes, zero alerts. The only signal was a client email. Data-correctness checks catch this class of bug; availability monitoring doesn't.
- When multi-tenancy gets layered onto a single-tenant codebase, assume every component that caches, queues, or stores data has the same blindspot. We went hunting through the queue worker next, and found two more.
The client whose data was exposed took the call well, which I did not expect. They wanted a written RCA within 48 hours. I spent a weekend on it.