We Set temperature=0 and GPT-4 Still Gave Different Answers — Our Entire CI Pipeline Broke
← Back
March 18, 2026AI10 min read

We Set temperature=0 and GPT-4 Still Gave Different Answers — Our Entire CI Pipeline Broke

Published March 18, 202610 min read

The CI run passed at 9:14 AM. The identical commit, re-run 40 minutes later, failed. No code changed, no dependencies updated, diff empty. Our automated code review step, powered by GPT-4 with temperature=0, had switched from "APPROVED" to "CHANGES_REQUESTED" between runs.

We'd spent three weeks building a pipeline on the assumption that temperature=0 meant deterministic output. It doesn't. It never did. We'd just been lucky.

What we built: an llm-powered code review gate

Our code review pipeline used GPT-4 to enforce an internal standard we called the "API contract checklist." Every PR that touched our public API surface ran a GitHub Actions job that sent each changed route's controller code to GPT-4 and asked it to verify 12 specific requirements: error response shape, pagination format, auth header handling, rate limit headers, and so on.

The output was a structured JSON verdict, each of the 12 checks either PASS or FAIL with a reason for each failure. A PR couldn't merge if any check failed. It had been running for six weeks and had caught 34 real issues that human reviewers missed. We were proud of it.

  THE PIPELINE
  ─────────────────────────────────────────────────────────────
  
  PR opened → GitHub Actions triggered
       │
       ▼
  Changed API controllers extracted (git diff)
       │
       ▼
  For each controller:
    GPT-4 (temperature=0, gpt-4-turbo-preview)
    + "Here are the 12 API contract requirements"  
    + "Here is the controller code"
    + "Return JSON: { check_id, result: PASS|FAIL, reason }"
       │
       ▼
  Results aggregated → PR status check set
  (All PASS → green, any FAIL → red)
  
  Assumption: temperature=0 = deterministic = testable

The flipping starts

The first sign was a Slack message from a developer on a Friday: "My PR passed CI twice and failed once on the exact same commit. Did someone change the review prompt?" Nobody had. We looked at the three run logs side by side. Runs 1 and 3 showed Check #7 (pagination format) as PASS. Run 2 showed it as FAIL, with a reason that was technically correct but contradicted the logic the model had used to pass it in runs 1 and 3.

We assumed it was a transient issue, a model serving blip or a load balancing artifact, and moved on. Over the next two weeks the flipping became more frequent. By the end of the second week, we had four developers who'd learned to just re-run the CI job if it failed on the API review step, because it would usually pass the second time.

That's when we knew we had a real problem. Our engineers had started treating an automated quality gate as a coin flip.

What temperature=0 Actually Guarantees

When you set temperature=0, the model uses greedy decoding. At each token position, it selects the highest-probability next token. In theory, given identical inputs and an identical model, this produces identical outputs. The key phrase is identical model.

OpenAI updates the models behind their API endpoints continuously. When you call gpt-4-turbo-preview, you're not calling a frozen model, you're calling whatever version OpenAI is currently serving under that alias. Model updates change the weights, which change the probability distributions, which change the greedy decoding output. Same prompt, same temperature=0, different model snapshot, different answer.

There's a second source of non-determinism that persists even with a pinned model version: floating-point non-determinism from GPU parallel computation. Transformer inference runs on GPUs with parallel matrix operations. The order of floating-point additions in parallel isn't deterministic at the hardware level, and floating-point arithmetic isn't associative. Two identical requests processed on different GPU hardware configurations can produce slightly different intermediate values, which can cascade into different token selections at positions where two tokens have nearly equal probability. (I had to look this up to verify it, because I didn't actually believe it at first.)

  WHY temperature=0 DOESN'T MEAN DETERMINISTIC
  ─────────────────────────────────────────────────────────────
  
  Source 1: Model updates
  
  Week 1: gpt-4-turbo-preview → model checkpoint A
               │
               │  OpenAI silent update
               ▼
  Week 3: gpt-4-turbo-preview → model checkpoint B
  
  Same prompt → different weight distributions → different output
  
  ─────────────────────────────────────────────────────────────
  
  Source 2: GPU floating-point non-determinism
  
  Token N probabilities:
  Token "PASS":  0.71823419...
  Token "FAIL":  0.71823418...  ← difference: 0.000000014
  
  On GPU cluster A: PASS wins (FP addition order favours PASS)
  On GPU cluster B: FAIL wins (different FP addition order)
  
  This is not a bug. This is how IEEE 754 floating-point works
  in parallel computation.

Finding the model version change

We added model fingerprinting to every API call, logging the system_fingerprint field that OpenAI returns in completions responses. This field changes when the underlying model is updated. Reviewing our logs, gpt-4-turbo-preview had updated its system fingerprint on the exact date our flipping rate had spiked from around 0.3% to around 8% of runs.

// We should have been logging this from day one
const response = await openai.chat.completions.create({ ... });

logger.info('llm_call', {
  model: response.model,
  system_fingerprint: response.system_fingerprint,  // log this always
  prompt_tokens: response.usage?.prompt_tokens,
  completion_tokens: response.usage?.completion_tokens,
  run_id: context.runId,
});

The fingerprint change explained most of the flipping, but not all of it. Even after the model stabilised at the new version, we still saw occasional inconsistency on specific inputs. That was the GPU floating-point issue affecting tokens with near-identical probabilities.

The fix: stop treating llms as oracles, start treating them as voters

The architectural fix was to stop relying on a single LLM call and use a majority vote across multiple calls. This is sometimes called self-consistency.

// Before: single call, single verdict
const verdict = await reviewCode(controller, requirements);
if (verdict.failures.length > 0) fail();

// After: 3 independent calls, majority vote
const verdicts = await Promise.all([
  reviewCode(controller, requirements),
  reviewCode(controller, requirements),
  reviewCode(controller, requirements),
]);

// For each check, count votes
const finalVerdict = requirements.map(check => {
  const votes = verdicts.map(v => v[check.id]);
  const failVotes = votes.filter(v => v === 'FAIL').length;
  
  // Require 2/3 agreement to FAIL a check
  // A single dissenting vote is not enough to block a PR
  return {
    check_id: check.id,
    result: failVotes >= 2 ? 'FAIL' : 'PASS',
    confidence: failVotes === 3 ? 'high' : failVotes === 2 ? 'medium' : 'low',
  };
});

We also pinned to a dated model snapshot instead of the rolling alias.

// Before (rolling alias — updates silently)
model: 'gpt-4-turbo-preview'

// After (pinned snapshot — stable until deprecated)
model: 'gpt-4-0125-preview'

// And we created a monthly task to review and update the pin
// after testing the new snapshot against our evaluation set

The 3-call majority vote added about $0.004 per PR check and increased latency from 4s to 7s (parallel calls). Flipping rate dropped to zero over the following 4 weeks of monitoring.

Building an evaluation set

The deeper fix was building a golden evaluation set: 200 controller examples with known correct verdicts that we run against any new model snapshot before updating the pin. That lets us catch regressions before they affect the live pipeline.

// evaluate.ts — run before updating model pin
const GOLDEN_SET = await loadGoldenSet(); // 200 labelled examples
const results = await Promise.all(
  GOLDEN_SET.map(async ({ controller, expectedVerdict }) => {
    const verdict = await reviewCode(controller, requirements);
    return { expected: expectedVerdict, actual: verdict, match: deepEqual(verdict, expectedVerdict) };
  })
);

const accuracy = results.filter(r => r.match).length / results.length;
console.log(`Model accuracy on golden set: ${(accuracy * 100).toFixed(1)}%`);
if (accuracy < 0.95) throw new Error('Model does not meet accuracy threshold. Do not update pin.');

Lessons

temperature=0 is not determinism, it's greedy. Greedy decoding on a non-frozen model is not reproducible. Never build a pipeline that requires identical LLM output across runs.

Never use rolling model aliases in production. gpt-4-turbo-preview, gpt-4o, claude-3-5-sonnet-latest, all of them can change under you without notice. Pin to dated snapshots. Test before updating.

Log system_fingerprint on every call. It's the only way to know if the model behind your API call changed between two runs.

For high-stakes decisions, use self-consistency (majority vote). Three independent calls with a 2/3 threshold is more reliable than one call with any temperature setting.

If engineers start re-running CI to get a different answer, your pipeline is broken. Non-determinism that developers learn to work around doesn't show up in your error metrics. It shows up in lost trust.

Share this
← All Posts10 min read