How SQLAlchemy's Identity Map Served Stale Data to 23,000 API Requests
We wanted explicit control over our SQLAlchemy sessions. No magic, no hidden Flask-SQLAlchemy
globals, just dependency injection that we owned. We set up scoped_session,
thought we'd written the teardown handler, deployed, and went home. Six hours later our HR
platform was quietly returning outdated employee records after every profile update. No
exceptions. No slow queries. Nothing in the logs. 23,000 confidently wrong API responses.
Production failure: updates that never stuck
The tickets came in clusters. An employee would update their email address, get an HTTP 200 with the new value in the response body, then open their profile thirty seconds later to find the old address still showing. Support initially blamed the frontend. A cache issue, probably. Clear your browser, try again. But the reports kept coming, from different browsers, different users, different offices.
I pulled the database rows manually. The updates were there. PostgreSQL had the correct values. Whatever was returning stale data wasn't the database. Something in our API layer had decided the database was optional.
False assumptions: "Sessions are stateless per-request"
Our mental model was simple and wrong. We'd read that scoped_session isolates
sessions by thread identity. We were running Gunicorn with four threaded workers
(--threads 4). I assumed that meant four isolated sessions, one per active
thread, each living exactly as long as its request. Clean. Correct. Thread-safe.
What I missed: scoped_session creates a registry keyed by thread
identity, but it does not create a new session for each request on that thread. It creates
one session per thread and hands back the same session every time you call
Session() from that thread. The session persists until you explicitly call
Session.remove(). We never did.
SQLAlchemy's scoped_session guarantees one session per thread.
It says nothing about one session per request. That contract is yours to enforce.
Compounding this: SQLAlchemy sessions maintain an identity map, a first-level cache that maps each loaded object's primary key to the in-memory Python object. When you query for an entity that's already in the identity map, SQLAlchemy skips the database entirely and returns the cached object. This is a performance feature designed for within-request deduplication. We were accidentally using it as a cross-request stale-data store.
Reproducing the bottleneck
The bug only showed up under load. On the local dev server with Flask's single-threaded
built-in server, each request got a fresh session because debug mode restarted the process
on every call. In staging we ran a single Gunicorn worker with no threads, so the same
worker was reused but the timing rarely produced symptoms we noticed. Production had four
concurrent threads per worker. One thread would handle the PUT /employees/42
update, then immediately handle a GET /employees/42 read from a different
client, and serve that client the version of Employee #42 loaded twenty requests ago.
I wrote a forty-line script that sent two requests on the same thread in sequence. A
write changing the email, then a read expecting the new email. Without
Session.remove(), the read returned the old email 100% of the time. With it,
the read always returned the updated value. Deterministic repro. At that point I stopped
feeling clever about skipping Flask-SQLAlchemy.
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker, scoped_session
engine = create_engine(
"postgresql+psycopg2://user:pass@localhost/hrdb",
pool_size=10,
max_overflow=5,
)
# Creates a thread-local session registry. Correct so far.
Session = scoped_session(sessionmaker(bind=engine))
def get_session():
return Session() # Returns the same session for this thread every time
# app.py — the missing teardown
from flask import Flask
from database.session import Session
app = Flask(__name__)
# ✗ Missing: @app.teardown_appcontext
# ✗ Missing: def shutdown_session(exception): Session.remove()
#
# Without this, the scoped_session's registry holds the session open.
# Every request on the same thread reuses the same session — and its
# identity map — indefinitely. The identity map grows. Old objects
# stay cached. Reads never hit the database for entities already loaded.
Root cause: the identity map doesn't know about requests
SQLAlchemy's identity map works well inside a single unit of work. Load
Employee(id=42) twice within one request and SQLAlchemy returns the same
Python object both times. No redundant round-trip, mutations automatically tracked on the
canonical object. This is exactly how an ORM should work.
The catch: the identity map has no concept of an HTTP request boundary. It lives on the
session. When the session outlives the request (which ours did for every request after
the first on each thread), the identity map carries stale state forward. An
Employee loaded at 14:00:01 stays in the map. When the same thread handles
GET /employees/42 at 14:00:34, SQLAlchemy checks the map, finds the object,
and returns it without emitting a SELECT.
BROKEN: scoped_session without Session.remove()
Thread 1, Gunicorn Worker A
─────────────────────────────────────────────────────────────────
Request 1 GET /employees/42
│ Session.get(Employee, 42)
│ → No identity map entry → emits SELECT → loads e42
│ Identity map: { Employee#42: {email: "old@corp.com"} }
│ Returns: old@corp.com ✓
│
─ request ends — NO Session.remove() — session survives ─────────
│
Request 2 PUT /employees/42 body: {email: "new@corp.com"}
│ Direct UPDATE via different DB connection (fine)
│ Commit succeeds. DB row now has new@corp.com.
│
─ request ends — NO Session.remove() — session survives ─────────
│
Request 3 GET /employees/42 (different user's client)
│ Session.get(Employee, 42)
│ → Identity map HIT → no SELECT emitted
│ → Returns cached: {email: "old@corp.com"} ✗
│
▼ 23,000 requests served stale data this way
Identity map grows unbounded. Database is bypassed.
No exception. No log line. No alert.
The PUT was issued over a separate database connection (our write path
opened its own connection for explicit transaction control). That connection committed
successfully. The database had the correct data. But the read session's identity map
was never invalidated. It had no way to know another connection had changed the row. It
returned what it remembered.
Architecture fix: enforce the request boundary
The fix is one function and one decorator: the teardown hook that Flask-SQLAlchemy registers automatically, which we forgot to add in our "explicit" setup.
from flask import Flask
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker, scoped_session
engine = create_engine(
"postgresql+psycopg2://user:pass@localhost/hrdb",
pool_size=10,
max_overflow=5,
# Expire all attributes on commit — fetch fresh values on next access.
# This is the default, but making it explicit prevents future regressions.
)
Session = scoped_session(
sessionmaker(bind=engine, expire_on_commit=True)
)
app = Flask(__name__)
@app.teardown_appcontext
def shutdown_session(exception=None):
"""
Called automatically at the end of every request context.
Session.remove() does two things:
1. Calls session.close() — returns the connection to the pool.
2. Removes the session from the scoped registry — next request
on this thread gets a brand-new session with an empty identity map.
"""
Session.remove()
def get_session():
"""
Within a request, always returns the same session (correct).
After teardown, the next call returns a new session (correct).
"""
return Session()
With this in place, the session lifecycle matches the request lifecycle. Every request starts with a clean identity map. Reads always hit the database. The performance cost is negligible: PostgreSQL query latency for a primary-key lookup averages 1.2ms on our cluster, well within SLA budget.
FIXED: scoped_session + Session.remove() in teardown
Thread 1, Gunicorn Worker A
─────────────────────────────────────────────────────────────────
Request 1 GET /employees/42
│ Session.get(Employee, 42)
│ → Empty identity map → emits SELECT → loads e42
│ Identity map: { Employee#42: {email: "old@corp.com"} }
│ Returns: old@corp.com ✓
│
─ request ends → @teardown_appcontext fires → Session.remove() ──
│ Identity map cleared. Session returned to pool.
│
Request 2 PUT /employees/42 body: {email: "new@corp.com"}
│ Fresh session. UPDATE + COMMIT. DB: new@corp.com ✓
│
─ request ends → @teardown_appcontext fires → Session.remove() ──
│ Identity map cleared. Session returned to pool.
│
Request 3 GET /employees/42 (different user's client)
│ Session.get(Employee, 42)
│ → Empty identity map → emits SELECT
│ → DB returns: {email: "new@corp.com"} ✓
│
▼ Every request: clean session, fresh data, correct result
Avg latency: 1.2ms per primary-key SELECT. Acceptable cost.
We added two more layers of defence. First, expire_on_commit=True set
explicitly in the session factory. This is the default, but naming it makes the intent
visible to future reviewers. When an object is committed, all its attributes expire, so
the next access reloads from the database even if the session happens to persist. Second,
a Prometheus gauge tracking active sessions per worker that alerts if a session outlives
a single request cycle.
What changed after the incident
- Session lifecycle test in CI. An integration test performs a write via one simulated request context, explicitly removes the session, then performs a read in a new context and asserts the read returns the updated value. Runs on every PR. Would have caught this bug before it reached staging.
-
Mandatory teardown linting. A custom Flake8 rule errors if any Flask module imports
scoped_sessionwithout aSession.remove()call inside a@app.teardown_appcontext-decorated function. Crude, effective. Fails the build if the safety net is missing. -
Session-per-request for new services. No global scoped registry, just a
with get_db_session() as session:block in each route handler. Explicit scope, explicit cleanup, no global state. The teardown approach works, but a context manager makes the lifecycle impossible to miss in review. -
Query emission logging in staging. SQLAlchemy's
echo=Trueis on, and a test fires a write followed by a read and asserts a SELECT is emitted for the read. No SELECT, no pass.
The broader pattern: ORM caches you didn't know you had
Every mature ORM ships with at least one caching layer, and almost none of them announce
themselves loudly. SQLAlchemy has the identity map (first-level cache) and optionally a
second-level cache. Django's ORM evaluates querysets lazily and re-executes them if you
iterate twice, but select_related and prefetch_related objects
are cached on the queryset. Hibernate has first and second-level caches. Each one is a
performance optimisation that becomes a correctness hazard the moment scope boundaries
slip.
The identity map is designed for within-transaction consistency, not cross-request consistency. When your session outlives your transaction, you've silently changed its contract.
The failure mode is always the same: fast, correct-looking, wrong. No exceptions, because nothing errored from the ORM's perspective. The session did exactly what it was designed to do and returned a cached object efficiently. The bug was in the frame. We were treating a transaction-scoped cache as if it were request-scoped.
If you're running Flask with SQLAlchemy outside Flask-SQLAlchemy, grep your codebase for
scoped_session right now. Every Flask app that imports it needs a
@app.teardown_appcontext handler calling Session.remove(). If
it doesn't, you have a latent stale-read bug that only fires under concurrent load. Which
means it's already firing in production before staging has a chance to catch it.
After we deployed the fix, read latency for employee endpoints went up by an average of 1.4ms due to the additional SELECT being emitted. It was the best regression I've ever shipped.