How a GitHub Actions Cache Hit Skipped Our Tests and Shipped a Regression to 12,000 Users
← Back
March 15, 2026CI/CD9 min read

How a GitHub Actions Cache Hit Skipped Our Tests and Shipped a Regression to 12,000 Users

Published March 15, 20269 min read

2:47 PM, Wednesday. CI green, deploy notification in Slack, thumbs-up emoji, business as usual. By 8:30 PM support had 340 tickets about a broken checkout flow. The feature had worked locally. It had worked in staging. CI never failed once. It just never tested the code that broke production.


Production failure: six hours of broken checkout

It started quietly. A handful of support tickets around 3:30 PM: users applying discount codes at checkout were seeing a blank error screen instead of the confirmation page. Everyone was head-down in another sprint and initial triage assumed an edge-case input issue.

By 6 PM ticket volume had spiked. We pulled the error logs. Every checkout attempt with a non-null discount_code field was throwing an uncaught TypeError: Cannot read properties of undefined (reading 'percentOff'). The stack trace pointed straight at the coupon-validation logic shipped in the 2:47 PM deploy.

12,000users hit the broken checkout
6 hrsbefore the regression was rolled back
340support tickets filed
6consecutive green CI runs that skipped the new tests

Rollback was easy. The postmortem question was harder: how did six CI runs pass with zero test failures while shipping an obvious undefined-access bug any unit test would have caught in under a second?


False assumptions: we trusted green without reading it

The most damaging assumption was that green CI means tests ran. We had 14 test suites covering the checkout flow, and three new files specifically for the discount-code feature. The pipeline output said "passed." Nobody ever checked what that meant in numerical terms.

A subtler assumption underneath it: a cache hit means the same work happens faster. Dependency caching is correct and obviously worth it. Restoring 800 MB of node_modules instead of re-downloading saves 90 seconds per run. But we'd extended the same cache to cover compiled test artifacts, and the assumption broke there.

The third assumption is the embarrassing one. A 4-second test step means fast tests, not absent tests. Our suite normally ran in 47 seconds. On incident day the test step finished in 4 seconds across all six affected deploys. Nobody flagged it because fast CI gets celebrated, not interrogated.


Investigation: a test step that ran in 4 seconds

The postmortem started by diffing CI run logs. A passing run from two weeks prior vs the incident-day runs:

.github/workflows — test step output comparison
# TWO WEEKS AGO (correct run)
Run jest --coverage --ci
  PASS src/checkout/__tests__/cart.test.ts
  PASS src/checkout/__tests__/pricing.test.ts
  PASS src/checkout/__tests__/discount.test.ts   <-- new file, compiled fresh
  PASS src/checkout/__tests__/coupon-validator.test.ts  <-- new file
  ... (14 suites total)

Test Suites: 14 passed, 14 total
Tests:       203 passed, 203 total
Time:        47.3s

# INCIDENT DAY (cache hit run)
Run jest --coverage --ci
  PASS src/checkout/__tests__/cart.test.ts

Test Suites: 1 passed, 1 total
Tests:       18 passed, 18 total
Time:        3.9s
Exit code: 0

One test suite. Eighteen tests. Exit code zero. The cache had restored a compiled test bundle from a run 11 days earlier, before the discount-code branch was even merged. Jest found no new .test.ts files outside the cached artifact set, ran only the cached bundle, and reported success on 18 tests instead of 203. I remember reading that log and feeling my stomach drop.

Pulling the cache restore log from the Actions run confirmed it:

github actions — cache restore log
Run actions/cache@v3
  with:
    path: .jest-cache
    key: node-test-v1-${{ hashFiles('**/package-lock.json') }}

Cache restored from key: node-test-v1-a3f9d2e8c1b7...
  Created: 2026-03-04T06:22:11Z  (11 days ago)
  Size: 142 MB

The cache key was hashFiles('**/package-lock.json'). The discount-code feature added new test files but no new npm dependencies. So package-lock.json didn't change, the hash matched, the 11-day-old cache came back, and the new test files were invisible to the cached Jest runner.


Root cause: cache key scope too narrow

The broken pipeline had a single cache entry covering both node_modules (correct to cache by lockfile) and .jest-cache, which is Jest's compiled transform cache and must invalidate when source files change:

BROKEN: Single Cache Key for node_modules + Jest Transform Cache
══════════════════════════════════════════════════════════════════════

 package-lock.json hash ──────────────────────┐
                                               v
                                   ┌─────────────────────┐
                                   │  Cache Key: v1-a3f9  │
                                   └─────────────────────┘
                                          │        │
                              ┌───────────┘        └──────────┐
                              v                               v
                    node_modules/ (800 MB)         .jest-cache/ (142 MB)
                    [correct: lockfile-bound]     [WRONG: stale transforms]

 New test files added ──> package-lock.json UNCHANGED
                      ──> cache key UNCHANGED
                      ──> .jest-cache restored from 11 days ago
                      ──> new .test.ts files NOT in cache
                      ──> Jest runs only cached suite (18 tests)
                      ──> exits 0 ✓  (lies)

══════════════════════════════════════════════════════════════════════
FIXED: Separate Cache Keys with Correct Scope
══════════════════════════════════════════════════════════════════════

 Cache 1: node_modules
   Key: node-modules-v1-${{ hashFiles('**/package-lock.json') }}
   Invalidates when: dependencies change (correct behavior)

 Cache 2: Jest transform cache
   Key: jest-cache-v1-${{ hashFiles('**/package-lock.json',
                                      'src/**/*.ts',
                                      'src/**/*.tsx') }}
   Invalidates when: deps OR source files change (correct behavior)

 Gate: Test count assertion
   if [ "$TEST_SUITES" -lt 14 ]; then
     echo "ERROR: expected ≥14 test suites, got $TEST_SUITES"
     exit 1
   fi

══════════════════════════════════════════════════════════════════════

The node_modules cache should be bound to the lockfile. That's standard. But the Jest transform cache stores compiled TypeScript and JSX artifacts for each source file. When new source files are added, the transform cache has to be invalidated. Binding it to the lockfile hash meant it only invalidated when dependencies changed, not when application code or tests changed. Which, for a team shipping features faster than dependencies, is basically never.


Architecture fix: separate caches, correct keys, and a count gate

The fix addressed three distinct problems: wrong cache scope, missing source-file invalidation, and no floor on the test count. Corrected workflow below.

.github/workflows/ci.yml — fixed cache strategy
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Cache node_modules
        uses: actions/cache@v4
        with:
          path: node_modules
          key: node-modules-v2-${{ hashFiles('**/package-lock.json') }}
          restore-keys: |
            node-modules-v2-

      - name: Cache Jest transforms
        uses: actions/cache@v4
        with:
          path: .jest-cache
          # Invalidate when deps OR any source/test file changes
          key: jest-cache-v2-${{ hashFiles('**/package-lock.json', 'src/**/*.ts', 'src/**/*.tsx') }}
          restore-keys: |
            jest-cache-v2-${{ hashFiles('**/package-lock.json') }}-
            jest-cache-v2-

      - name: Install dependencies
        run: npm ci --prefer-offline

      - name: Run tests
        id: test
        run: |
          npx jest --coverage --ci --cacheDirectory=.jest-cache             --json --outputFile=jest-results.json 2>&1 | tee jest-output.txt
          echo "exit_code=$?" >> $GITHUB_OUTPUT

      - name: Assert minimum test suite count
        run: |
          SUITES=$(jq '.numPassedTestSuites' jest-results.json)
          TESTS=$(jq '.numPassedTests' jest-results.json)
          echo "Test suites passed: $SUITES"
          echo "Tests passed: $TESTS"
          if [ "$SUITES" -lt 14 ]; then
            echo "::error::Expected ≥14 test suites, got $SUITES. Cache may be stale or tests deleted."
            exit 1
          fi
          if [ "$TESTS" -lt 180 ]; then
            echo "::error::Expected ≥180 tests, got $TESTS. Count regressed."
            exit 1
          fi

Three things changed.

First, separate cache entries for separate concerns. node_modules is correctly keyed to the lockfile. The Jest transform cache is keyed to both the lockfile and a glob of all TypeScript source and test files. Adding a new .test.ts file now changes the hash, busts the Jest cache, forces fresh compilation, and ensures the new file is actually found and executed.

Second, restore-keys as a fallback. On a complete cache miss (first run after a major refactor, for example), restore-keys provides a partial match that pre-warms the transform cache with recently compiled artifacts. Jest recompiles only changed files on top of that partial restore. Faster than a cold start, always correct.

Third, an explicit test-count floor. The count gate fails the pipeline if fewer than 14 test suites or 180 tests pass. It catches both cache-skipping regressions and accidental test deletion. We bump the threshold in a single commit whenever we add a new test suite. It's lightweight and it's caught two bugs since.

47s → 22stest step time after fix (partial cache hit)
14 suitesminimum gate, enforced in pipeline
0silent test-skip incidents since the fix
2 bugscaught by count gate in the following month

Why the 11-day-old cache survived so long

"We had been shipping new features every two to three days without touching package-lock.json. Every single deploy hit the stale cache and silently shed its tests. The cache was getting more dangerous the longer it survived."

In the 11 days between the cache creation and the incident, the team shipped seven features. None touched npm dependencies. Each deploy picked up the 11-day-old .jest-cache, compiled only the changed source files into memory (but not into the cached artifacts), and ran only the tests that were in the restored cache bundle. The regression was effectively undetectable because the tests that would have caught it simply didn't exist in the environment where CI ran.

The problem compounded because the Jest transform cache is a performance artifact, not a test registry. Jest doesn't error when a file is absent from the cache. It treats absence as "nothing to run here" rather than "this file is new, compile it." That's correct behaviour for performance and catastrophic when paired with a stale cache covering the wrong scope.


Lessons learned

  • Cache key scope has to match the artifact's actual dependencies. node_modules depends on the lockfile. The Jest transform cache depends on the lockfile plus the source files. Merging them under a single lockfile-only key silently breaks test discovery the moment new files show up. Audit what each cached artifact actually depends on before writing the key.
  • Add a test-count floor to every pipeline. Exit code 0 means "everything that ran passed." It does not mean "everything that should have run did run." A minimum-count assertion turns silent omissions into hard failures, and it's the single highest-value safeguard we've added.
  • Fast CI steps deserve scrutiny, not celebration. A test step finishing in 4 seconds when it normally takes 47 is a signal that work was skipped, not that the suite got faster. Build timing baselines and alert on anomalous drops.
  • Separate caches for separate concerns. node_modules, compiled test artifacts, build output, and coverage reports each have different invalidation requirements. Bundling them under one key is convenient and wrong the moment any of their dependency sets diverges.
  • Validate CI output structure, not just exit code. We now parse jest-results.json and assert on suite count, test count, and coverage thresholds. Exit code is a binary signal. Structured output is richer. Use the richer one.

Green now means something specific in our pipeline: at least 14 test suites ran, at least 180 tests passed, and the cache that served them was invalidated by any change to the files those tests cover. Six hours of broken checkout and 340 tickets later, that's the actual guarantee we walked away with.

Share this
← All Posts9 min read