Our OpenAI Bill Went From $23 to $4,200 in 48 Hours — A Missing Stop Sequence Did It
Tuesday after a long weekend. Our Head of Engineering opens the OpenAI billing dashboard and assumes the page hasn't loaded properly. Previous month: $23. Current month, running total: $4,218.40. All of it accumulated over 48 hours while the office was empty.
14,000 API calls. 40.3 million tokens. The model wasn't doing anything useful with any of them. It was stuck in a generation loop, producing output that fed back into itself, indefinitely, at $0.03 per 1,000 tokens. I found out when I came in Monday morning and saw the Head of Engineering standing very still at her desk.
The pipeline: feedback categorisation at scale
We'd built a background pipeline to process user feedback submissions (bug reports, feature requests, NPS responses) and categorise them, pull out action items, and route them to the right team channel. GPT-4 Turbo via the OpenAI API, running as a queue consumer on SQS. It had been working fine for two months.
The prompt was roughly:
const prompt = `
You are a product feedback analyst. Given the following user feedback,
output a JSON object with:
- category: one of [bug, feature, question, complaint, praise]
- severity: one of [critical, high, medium, low]
- summary: a 1-2 sentence summary
- actionItems: array of specific action items for the product team
- sentiment: score from -1.0 to 1.0
User feedback:
${feedbackText}
Respond with only the JSON object, no markdown fencing.
`;
For two months, this worked perfectly. The JSON came back clean, parsed correctly, routed correctly. Then we shipped a change.
The change that broke everything
A PM asked for one addition: a suggestedResponse field, a draft reply we
could send back to the user. I updated the prompt, tested it manually with 10 feedback
samples, all 10 worked. Deployed Friday afternoon. (You can already see where this is
going.)
The new field was described as: "suggestedResponse: a friendly, empathetic response to send to the user, acknowledging their feedback and describing next steps."
Here's what I missed. For certain long, emotional feedback submissions, particularly
NPS detractor rants that ran to several paragraphs, the model would produce a long
suggestedResponse that itself contained user-quoted text. Our parser, looking
for the closing } of the JSON, couldn't find it: the
suggestedResponse value contained unescaped curly braces from template
literals in the model's draft response text. Parse fails. Message goes back to the queue.
Consumer picks it up. GPT-4 gets called again. And again.
THE LOOP
─────────────────────────────────────────────────────────────
Long feedback message arrives in SQS
│
▼
GPT-4 called → generates suggestedResponse with {template} syntax
│
▼
JSON.parse() throws SyntaxError
(unescaped { } in suggestedResponse string value)
│
▼
Error handler: message returns to queue (visibility timeout: 30s)
│
▼
Consumer picks it up again after 30s
│
└──── back to top ──── (repeats forever)
SQS maxReceiveCount: not set (default: unlimited)
Dead letter queue: configured but wrong ARN (never received)
Alert threshold: $500 spend (never reached before weekend started)
14,000 API calls × avg 2,850 tokens each = 39.9M tokens
Cost: $4,218.40 over 48 hours
Why our safeguards all failed simultaneously
We had three things in place that should have caught this. None of them did.
The SQS dead letter queue was configured. But when we'd migrated queue infrastructure two months earlier, the DLQ ARN in our Terraform config still pointed to the old environment. Nobody had noticed because the happy path kept working, and we'd never deliberately failed a message to verify delivery.
Spend alerts were set in CloudWatch at $500. The spend started Friday evening. AWS billing alerts lag by 6 to 12 hours because of usage aggregation, so the alert didn't fire until Sunday afternoon at $3,847. By that point it was Sunday afternoon of a long weekend and nobody was checking email.
We also tracked the error rate of the feedback pipeline. The consumer was catching the
JSON.parse exception internally and returning the message to the queue, so
it never surfaced as an application error. From metrics' perspective, the consumer was
perfectly healthy. Receiving messages, processing them, no uncaught exceptions. Just
lighting money on fire in the background.
The two root causes
When we did the post-mortem, we identified two independent root causes, either of which alone would have prevented the incident:
First: no stop sequence, no max_tokens. The API call had neither, so the
model could generate arbitrarily long output. For the inputs that triggered the loop,
each call was generating around 4,200 tokens before the context window cut it off. A
max_tokens: 800 cap (more than enough for our real output) would have made
each failed call 85% cheaper and kept us under the alert threshold for much longer.
Second: retry logic with no ceiling and a broken DLQ. SQS maxReceiveCount
defaults to unlimited if the redrive policy is misconfigured. A message that perpetually
fails parse will perpetually retry. Infinite retry plus broken DLQ plus no per-message
retry counter means there's no circuit breaker at any layer. Either fix in isolation
would have stopped the incident.
The fixes
We shipped four changes the same day:
// 1. Always cap tokens
const response = await openai.chat.completions.create({
model: 'gpt-4-turbo',
messages: [{ role: 'user', content: prompt }],
max_tokens: 800, // hard cap
stop: ['}
', '}
'], // stop on JSON object close
temperature: 0.2, // reduce variability for structured output
response_format: { type: 'json_object' }, // enforce JSON mode
});
// 2. Track retry count per message, dead-letter after 3 attempts
const receiveCount = parseInt(message.Attributes?.ApproximateReceiveCount || '1');
if (receiveCount > 3) {
await sendToDeadLetter(message, 'max_retries_exceeded');
await deleteFromQueue(message);
return;
}
// 3. Validate DLQ ARN on startup
await validateQueueExists(process.env.DLQ_URL!);
// 4. Set spend alert at $50, not $500
// (done in AWS console + Terraform)
The response_format: { type: 'json_object' } change mattered most. OpenAI's
JSON mode guarantees valid JSON output, no unescaped curly braces, no markdown fencing,
no trailing text. That alone would have prevented the whole mess. We hadn't used it
because it was a newer API feature that wasn't in the documentation we'd copied from
when we first built the pipeline. Nobody went back to check if anything new had shipped.
What OpenAI did
We contacted OpenAI support the same day. They looked at the usage logs, confirmed the pattern was consistent with a runaway retry loop, and refunded $3,400 as a one-time goodwill credit. Grateful, obviously, but under no illusion that it was guaranteed. Their terms don't require it. The financial exposure was real, and next time we might not get the credit.
Lessons
- Always set
max_tokens. Never let an LLM API call have unlimited output length in an automated pipeline. Calculate your maximum expected output and cap at 1.5x that. - Use
response_format: json_objectfor structured output. Unstructured JSON parsing from free-form LLM output is a reliability anti-pattern. - Verify your dead letter queue actually works by sending a test message that will always fail and confirming it reaches the DLQ. Do this after every infrastructure migration. We didn't.
- Spend alerts lag by hours. Set them at 10% of your pain threshold. A $50 alert would have fired Sunday morning when someone might still have been checking their phone.
- LLM API cost is unbounded by default. Compute costs are capped by instance size; token costs scale linearly with runaway loops. Every LLM call is a potential infinite cost if your retry logic is broken.