We Asked GPT-4 to Review Our Pull Requests for 30 Days — It Approved the Bug That Took Down Prod
The security audit report arrived on a Tuesday. Line 4: "23 admin API routes accessible without valid authentication token." We had a GPT-4 code review gate in CI that had been running for 30 days. 847 PRs reviewed, 61 real issues flagged. It had specifically approved the PR that introduced this bug, with the comment: "Auth check looks correct, permission validation is present."
The permission validation was present. It just never ran.
The setup: 30 days of automated code review
Six weeks before the audit, we built an LLM-powered review gate to enforce our internal security checklist. Every PR that touched API route handlers triggered a GitHub Actions job. It extracted the changed files, sent them to GPT-4 with a structured prompt covering 14 security checks, and posted the results as a required status check. No APPROVED verdict, no merge.
The prompt checked for missing auth middleware, improper permission validation, exposed internal
fields in responses, rate limiting. We had tuned it over two weeks with real examples and it was
genuinely good. In the first month it caught an IDOR vulnerability, three missing auth guards on
new routes, and a role check that used == instead of ===. We trusted it.
Probably too much, in hindsight, but at the time it felt like having a tireless junior reviewer who
never forgot the checklist.
THE REVIEW PIPELINE
──────────────────────────────────────────────────────────────
PR opened (touches /api/** routes)
│
▼
GitHub Actions: extract changed files from diff
│
▼
GPT-4 (gpt-4-turbo, temp=0.1)
┌─────────────────────────────────────┐
│ "Review these files against our │
│ 14-point security checklist. │
│ Return JSON verdict per check." │
└─────────────────────────────────────┘
│
▼
Result: { check: "auth_guard", verdict: "PASS", reason: "..." }
│
All 14 PASS → ✅ Merge allowed
Any FAIL → ❌ PR blocked
The PR that slipped through
The PR looked routine. A junior engineer was refactoring our admin route middleware to support a new
role hierarchy. The diff showed clean TypeScript: a middleware function that checked
user.hasPermission('admin') before passing to the route handler. GPT-4 reviewed it and
returned all 14 checks as PASS. Check #3, auth guard, read:
"Permission validation is present via hasPermission() call. Auth check looks correct."
The check was present. But it was never reached.
// admin-middleware.ts (what was in the diff)
import { requirePermission } from './auth-helpers';
import { devUserFactory } from '../../test/factories/user-factory'; // ← this line
export const adminGuard = async (req: Request, res: Response, next: NextFunction) => {
const user = req.user ?? devUserFactory.createAdmin(); // fallback for "dev mode"
if (user.hasPermission('admin')) {
return next();
}
return res.status(403).json({ error: 'Forbidden' });
};
The bug is on line 2 and line 5. The devUserFactory import was a test helper that had
been used during local development and never removed. When req.user was undefined (which
it always is for unauthenticated requests), the middleware fell back to
devUserFactory.createAdmin(), which returned a mock admin user object. That object passed
hasPermission('admin') every time. Unauthenticated requests were being handed a mock
admin identity.
GPT-4 saw the diff. The diff showed a permission check. The permission check existed. PASS.
Why GPT-4 could not catch it
This was a failure of what we gave the model, not the model itself.
The bug required understanding three things at once:
devUserFactorylives intest/factories/user-factory.ts, not in the diff.- The factory's
createAdmin()method returns a fully-permissioned mock. Also not in the diff. req.userisundefinedfor unauthenticated requests. That's defined in Express type extensions. Not in the diff either.
GPT-4 reviewed the diff. The diff was three lines. The vulnerability lived in the intersection of three files across two directories, none of which changed in this PR.
WHAT GPT-4 SAW WHAT GPT-4 NEEDED TO SEE
───────────────────── ──────────────────────────────────────
admin-middleware.ts ✅ admin-middleware.ts ✅
(3 lines changed) test/factories/user-factory.ts ❌
types/express/index.d.ts ❌
(req.user type — undefined when unauthed)
GPT-4 context: diff Required context: full import graph
Result: PASS Correct result: FAIL — mock leaking to prod
11 Days undetected
The PR merged on a Friday afternoon. The next 11 days were normal. No spike in admin API traffic, no unusual error rates, no alerts. The routes were sitting there wide open and nobody happened to probe them systematically until our quarterly security audit ran an automated scanner against staging (which mirrored prod).
The scanner hit GET /api/admin/users without an auth header. Got a 200 with a full user list.
Hit all 23 admin routes. Got 200s on all of them. The report flagged every single one.
We rolled back the middleware immediately. Total exposure window: 11 days on staging, confirmed same on production after log analysis. No evidence of exploitation in access logs, but 11 days is 11 days.
The fix: context-aware review + hard rules
We made two changes to the review pipeline.
First, we expanded what we send GPT-4. Instead of just the diff, we now include the full content
of every file imported by the changed files, one level deep. For the middleware above, that means
auth-helpers.ts, user-factory.ts, and the Express type extensions all go
into context. Token cost went up. Accuracy went up more.
// Before: just the diff
const reviewContext = getDiffFiles(pr);
// After: diff + one level of imports
const reviewContext = [
...getDiffFiles(pr),
...getImportedFiles(getDiffFiles(pr), { depth: 1, excludeNodeModules: true })
];
// Cap at 80KB to stay within context window
const truncated = truncateToTokenBudget(reviewContext, 60_000);
Second, we added a hard rule: no automated reviewer (LLM or otherwise) can approve any file that imports from test/, __mocks__/, or *.fixture.* paths in production code. That pattern is always a FAIL, no model judgment required.
const TEST_IMPORT_PATTERN = /from ['"].*/(test|__mocks__|fixtures|factories)//;
function preflightChecks(files: string[]): ReviewViolation[] {
const violations: ReviewViolation[] = [];
for (const file of files) {
if (TEST_IMPORT_PATTERN.test(file.content)) {
violations.push({
file: file.path,
rule: 'no-test-imports-in-prod',
severity: 'critical',
message: 'Production file imports from test directory — automatic FAIL, do not merge.'
});
}
}
return violations;
}
Lessons
After 30 days and 847 reviews, here's what actually stuck.
LLMs review what you give them, not what you mean. A diff is a fragment. Security vulnerabilities often live in the intersection of multiple files, and if you only give the model the fragment, it will only review the fragment. Confidently, and sometimes incorrectly.
For security-critical paths we now treat GPT-4 as an advisor, not a gatekeeper. It's exceptional at catching common patterns it has seen before: missing null checks, SQL injection shapes, obvious IDOR patterns. It's unreliable for contextual reasoning across file boundaries. So it runs as a second opinion alongside human review, not as a replacement.
And for the patterns that matter most, we just wrote code. The test-import-in-production pattern isn't a judgment call, it's always wrong. Regex catches it in milliseconds with zero false negatives. Save the LLM for the ambiguous cases where human-like reasoning actually helps.