March 25, 2026AI11 min read

We Asked GPT-4 to Review Our Pull Requests for 30 Days — It Approved the Bug That Took Down Prod

Published March 25, 202611 min read

The security audit report arrived on a Tuesday. Line 4: "23 admin API routes accessible without valid authentication token." We had a GPT-4 code review gate in CI that had been running for 30 days. 847 PRs reviewed, 61 real issues flagged. It had specifically approved the PR that introduced this bug, with the comment: "Auth check looks correct, permission validation is present."

The permission validation was present. It just never ran.

The setup: 30 days of automated code review

Six weeks before the audit, we built an LLM-powered review gate to enforce our internal security checklist. Every PR that touched API route handlers triggered a GitHub Actions job. It extracted the changed files, sent them to GPT-4 with a structured prompt covering 14 security checks, and posted the results as a required status check. No APPROVED verdict, no merge.

The prompt checked for missing auth middleware, improper permission validation, exposed internal fields in responses, rate limiting. We had tuned it over two weeks with real examples and it was genuinely good. In the first month it caught an IDOR vulnerability, three missing auth guards on new routes, and a role check that used == instead of ===. We trusted it. Probably too much, in hindsight, but at the time it felt like having a tireless junior reviewer who never forgot the checklist.

  THE REVIEW PIPELINE
  ──────────────────────────────────────────────────────────────

  PR opened (touches /api/** routes)
       │
       ▼
  GitHub Actions: extract changed files from diff
       │
       ▼
  GPT-4 (gpt-4-turbo, temp=0.1)
  ┌─────────────────────────────────────┐
  │ "Review these files against our     │
  │  14-point security checklist.       │
  │  Return JSON verdict per check."    │
  └─────────────────────────────────────┘
       │
       ▼
  Result: { check: "auth_guard", verdict: "PASS", reason: "..." }
       │
  All 14 PASS → ✅ Merge allowed
  Any FAIL    → ❌ PR blocked

The PR that slipped through

The PR looked routine. A junior engineer was refactoring our admin route middleware to support a new role hierarchy. The diff showed clean TypeScript: a middleware function that checked user.hasPermission('admin') before passing to the route handler. GPT-4 reviewed it and returned all 14 checks as PASS. Check #3, auth guard, read: "Permission validation is present via hasPermission() call. Auth check looks correct."

The check was present. But it was never reached.

typescript — the middleware GPT-4 reviewed (the diff)

// admin-middleware.ts (what was in the diff)
import { requirePermission } from './auth-helpers';
import { devUserFactory } from '../../test/factories/user-factory'; // ← this line

export const adminGuard = async (req: Request, res: Response, next: NextFunction) => {
  const user = req.user ?? devUserFactory.createAdmin(); // fallback for "dev mode"
  if (user.hasPermission('admin')) {
    return next();
  }
  return res.status(403).json({ error: 'Forbidden' });
};

The bug is on line 2 and line 5. The devUserFactory import was a test helper that had been used during local development and never removed. When req.user was undefined (which it always is for unauthenticated requests), the middleware fell back to devUserFactory.createAdmin(), which returned a mock admin user object. That object passed hasPermission('admin') every time. Unauthenticated requests were being handed a mock admin identity.

GPT-4 saw the diff. The diff showed a permission check. The permission check existed. PASS.

Why GPT-4 could not catch it

This was a failure of what we gave the model, not the model itself.

The bug required understanding three things at once:

devUserFactory lives in test/factories/user-factory.ts, not in the diff.
The factory's createAdmin() method returns a fully-permissioned mock. Also not in the diff.
req.user is undefined for unauthenticated requests. That's defined in Express type extensions. Not in the diff either.

GPT-4 reviewed the diff. The diff was three lines. The vulnerability lived in the intersection of three files across two directories, none of which changed in this PR.

  WHAT GPT-4 SAW          WHAT GPT-4 NEEDED TO SEE
  ─────────────────────   ──────────────────────────────────────

  admin-middleware.ts ✅   admin-middleware.ts ✅
  (3 lines changed)        test/factories/user-factory.ts ❌
                           types/express/index.d.ts ❌
                           (req.user type — undefined when unauthed)

  GPT-4 context: diff      Required context: full import graph
  Result: PASS             Correct result: FAIL — mock leaking to prod

11 Days undetected

The PR merged on a Friday afternoon. The next 11 days were normal. No spike in admin API traffic, no unusual error rates, no alerts. The routes were sitting there wide open and nobody happened to probe them systematically until our quarterly security audit ran an automated scanner against staging (which mirrored prod).

The scanner hit GET /api/admin/users without an auth header. Got a 200 with a full user list. Hit all 23 admin routes. Got 200s on all of them. The report flagged every single one.

We rolled back the middleware immediately. Total exposure window: 11 days on staging, confirmed same on production after log analysis. No evidence of exploitation in access logs, but 11 days is 11 days.

The fix: context-aware review + hard rules

We made two changes to the review pipeline.

First, we expanded what we send GPT-4. Instead of just the diff, we now include the full content of every file imported by the changed files, one level deep. For the middleware above, that means auth-helpers.ts, user-factory.ts, and the Express type extensions all go into context. Token cost went up. Accuracy went up more.

typescript — expanded context extraction

// Before: just the diff
const reviewContext = getDiffFiles(pr);

// After: diff + one level of imports
const reviewContext = [
  ...getDiffFiles(pr),
  ...getImportedFiles(getDiffFiles(pr), { depth: 1, excludeNodeModules: true })
];

// Cap at 80KB to stay within context window
const truncated = truncateToTokenBudget(reviewContext, 60_000);

Second, we added a hard rule: no automated reviewer (LLM or otherwise) can approve any file that imports from test/, __mocks__/, or *.fixture.* paths in production code. That pattern is always a FAIL, no model judgment required.

typescript — hard-fail rule before LLM review

const TEST_IMPORT_PATTERN = /from ['"].*/(test|__mocks__|fixtures|factories)//;

function preflightChecks(files: string[]): ReviewViolation[] {
  const violations: ReviewViolation[] = [];
  for (const file of files) {
    if (TEST_IMPORT_PATTERN.test(file.content)) {
      violations.push({
        file: file.path,
        rule: 'no-test-imports-in-prod',
        severity: 'critical',
        message: 'Production file imports from test directory — automatic FAIL, do not merge.'
      });
    }
  }
  return violations;
}

Lessons

After 30 days and 847 reviews, here's what actually stuck.

LLMs review what you give them, not what you mean. A diff is a fragment. Security vulnerabilities often live in the intersection of multiple files, and if you only give the model the fragment, it will only review the fragment. Confidently, and sometimes incorrectly.

For security-critical paths we now treat GPT-4 as an advisor, not a gatekeeper. It's exceptional at catching common patterns it has seen before: missing null checks, SQL injection shapes, obvious IDOR patterns. It's unreliable for contextual reasoning across file boundaries. So it runs as a second opinion alongside human review, not as a replacement.

And for the patterns that matter most, we just wrote code. The test-import-in-production pattern isn't a judgment call, it's always wrong. Regex catches it in milliseconds with zero false negatives. Save the LLM for the ambiguous cases where human-like reasoning actually helps.