March 12, 2026Architecture10 min read

How a Redis Cache Key Missing One Field Leaked Client Data Across Tenants for 72 Hours

Published March 12, 202610 min read

A Tuesday afternoon. A client emailed support saying their dashboard was showing the wrong company name. I assumed a display bug. Stale frontend cache, maybe a mismatched JOIN, something cosmetic. It took four hours to accept the real answer: for 72 hours one paying enterprise tenant had been reading another tenant's confidential project data, served by our own Redis cache, quietly, with no errors anywhere.

Production failure

The platform was a project management SaaS with multiple enterprise clients, each in their own isolated workspace. Tenant isolation was enforced at the database layer: every query scoped by tenant_id, every record owned by exactly one org. We'd audited this twice. The database layer was clean.

What we hadn't audited was the caching layer.

Our Flask API cached expensive responses in Redis using keys like project:{project_id}. Project IDs were auto-incrementing integers from Postgres. Tenant A's project #1041 and Tenant B's project #1041 are two completely separate records. To Redis, they were the same key.

Tenant A loaded their project first. Redis cached it as project:1041. Tenant B loaded theirs 40 minutes later. Redis handed back Tenant A's data. Tenant B's UI rendered it without complaining, because the shape matched and the fields matched. Only the content was wrong, and the UI has no way of knowing that.

72h data visible across tenants

3 affected tenant pairs

~1,200 poisoned cache reads served

0 alerts fired

False assumptions

First instinct was frontend. The React app maintained local state, and the obvious suspect was a stale context or a component that didn't re-fetch on navigation. We spent 90 minutes in browser devtools before confirming the API itself was returning the wrong data.

Second instinct: the database query. We grepped every query file for missing tenant_id filters. Found nothing. Ran the suspect query manually with both tenant IDs. Each returned exactly the right rows. The database was correct.

Which, honestly, felt worse than finding a bug.

"The database is clean. The API returns the wrong data. If the query is right and the result is wrong, something between the query and the response is substituting the answer."

That sentence is what finally pointed us at the cache.

Reproducing the poisoning

Reproducing it took 8 minutes once we had the hypothesis. Two test tenant accounts, each with a fresh project so they'd end up sharing an auto-increment ID. Load Tenant A's project endpoint. Redis caches it. Switch auth header to Tenant B, hit the same endpoint. Redis returns Tenant A's payload. Confirmed.

  POISONED REQUEST FLOW
  ─────────────────────────────────────────────────────────────

  Tenant A — GET /api/projects/1041
  ┌─────────┐     ┌──────────────────┐     ┌────────────────┐
  │ Client A │────▶│  Flask API       │────▶│ Redis          │
  └─────────┘     │                  │     │ MISS            │
                  │ cache_key =      │     │                │
                  │ "project:1041"   │◀────│ SET project:   │
                  │                  │     │ 1041 = {A data}│
                  └──────────────────┘     └────────────────┘
                          │
                          ▼
                  DB query WHERE id=1041    ← correct, returns A's row
                  Cache SET project:1041    ← keyed without tenant


  40 minutes later — Tenant B — GET /api/projects/1041
  ┌─────────┐     ┌──────────────────┐     ┌────────────────┐
  │ Client B │────▶│  Flask API       │────▶│ Redis          │
  └─────────┘     │                  │     │ HIT ✓          │
                  │ cache_key =      │◀────│ "project:1041" │
                  │ "project:1041"   │     │ = {A data} ⚠️  │
                  └──────────────────┘     └────────────────┘
                          │
                          ▼
                  Returns Tenant A's data to Tenant B ← never touches DB


  CORRECT FLOW (after fix)
  ─────────────────────────────────────────────────────────────

  Tenant B — GET /api/projects/1041
  ┌─────────┐     ┌──────────────────────────────────────────┐
  │ Client B │────▶│  cache_key = "project:{tenant_id}:1041"  │
  └─────────┘     │  = "project:tenant_b_uuid:1041"          │
                  │                                          │
                  │  Redis MISS → DB query → cache SET       │
                  │  Returns Tenant B's data ✓               │
                  └──────────────────────────────────────────┘

Root cause: cache keys built without tenant scope

The caching helper was written before multi-tenancy was added to the platform. When the tenant layer was bolted on later, the database queries were updated correctly. The cache key builder was never touched. Nobody remembered it existed.

api/cache.py — before and after

# BEFORE — tenant-blind cache key
def get_project(project_id: int):
    cache_key = f"project:{project_id}"
    cached = redis.get(cache_key)
    if cached:
        return json.loads(cached)

    row = db.execute(
        "SELECT * FROM projects WHERE id = %s AND tenant_id = %s",
        (project_id, g.tenant_id)   # DB query is scoped correctly
    ).fetchone()

    redis.setex(cache_key, 300, json.dumps(row))  # cache key is NOT
    return row


# AFTER — tenant-scoped cache key
def get_project(project_id: int):
    # Include tenant_id in the key — different tenants never share a cache entry
    cache_key = f"project:{g.tenant_id}:{project_id}"
    cached = redis.get(cache_key)
    if cached:
        return json.loads(cached)

    row = db.execute(
        "SELECT * FROM projects WHERE id = %s AND tenant_id = %s",
        (project_id, g.tenant_id)
    ).fetchone()

    if row is None:
        return None  # also: don't cache a None — that's a separate bug

    redis.setex(cache_key, 300, json.dumps(row))
    return row


# ALSO ADDED — cache key audit helper (run in CI)
CACHE_KEY_PATTERNS = {
    "project": "project:{tenant_id}:{project_id}",
    "member":  "member:{tenant_id}:{member_id}",
    "report":  "report:{tenant_id}:{report_id}:{date_range}",
}
# Any cache SET that doesn't match a known pattern raises in staging

Architecture fix: tenant-scoped keys + cache audit layer

The immediate fix was prefixing every cache key with tenant_id. We picked the tenant UUID over the integer PK specifically to prevent enumeration; an attacker who can influence a cache key should not be able to guess another tenant's key by incrementing an int.

We also considered Postgres row-level security as the "real" fix, so it would be structurally impossible for a query to return cross-tenant data even if the WHERE tenant_id clause went missing. We want to get there. It's a schema migration and careful testing across 40+ query sites, though, and the cache key fix was safe and deployable inside an hour. RLS is on the roadmap. It has been on the roadmap for a while now.

The second layer was a cache key registry. Every valid cache key shape lives in a central manifest. Any cache write in staging that uses an unregistered pattern raises. That turns "someone wrote a cache key without tenant scope" from a silent production bug into a CI failure, which is where this class of problem belongs.

  CACHE KEY AUDIT IN CI PIPELINE
  ─────────────────────────────────────────────────────────────

  Developer writes new cached endpoint
          │
          ▼
  cache.set("new_resource:{id}", data)
          │
          ▼
  CI: run cache_key_audit.py
          │
          ├── Key matches registered pattern? ──▶ ✅ PASS
          │
          └── Key NOT in registry? ──────────────▶ ❌ FAIL
                                                    "Cache key 'new_resource:{id}'
                                                     missing tenant scope.
                                                     Register in CACHE_KEY_PATTERNS
                                                     or add tenant_id prefix."

Lessons learned

Tenant isolation is not just a database concern. Every layer that persists or caches data (Redis, CDN, in-memory stores, even log aggregators) needs to be audited for tenant scope when you add multi-tenancy.
Auto-increment IDs across tenants will eventually collide. If you use integer PKs, two tenants will at some point have the same resource ID. Cache keys have to carry the tenant identifier.
Silent correctness failures are worse than crashes. This ran for 72 hours with zero error rates, zero latency spikes, zero alerts. The only signal was a client email. Data-correctness checks catch this class of bug; availability monitoring doesn't.
When multi-tenancy gets layered onto a single-tenant codebase, assume every component that caches, queues, or stores data has the same blindspot. We went hunting through the queue worker next, and found two more.

The client whose data was exposed took the call well, which I did not expect. They wanted a written RCA within 48 hours. I spent a weekend on it.