Our AI Documentation Bot Invented 14 API Routes That Never Existed — 6,000 Users Integrated Against Them
Tuesday afternoon. A developer from one of our largest enterprise customers opens a
support ticket. "Your POST /v2/webhooks/replay endpoint keeps returning
404. Has it been deprecated?"
I check the route table. POST /v2/webhooks/replay has never existed. Our
shiny new AI documentation assistant invented it, described it in detail with request
and response examples, rate limit notes, error codes, the full treatment. At least
6,000 developers had read that page by the time we noticed.
This is what happens when you deploy an LLM without a ground-truth validation layer.
The setup: a docs bot that seemed to work perfectly
We'd built a documentation assistant for our REST API. Standard RAG setup using GPT-4 Turbo: developers ask questions in natural language, we retrieve relevant chunks from our OpenAPI spec and markdown docs, and GPT-4 writes a helpful answer on top of that context.
In testing, it was good. Genuinely good. It answered questions about auth flows, pagination patterns, webhook configurations correctly, with accurate code examples. We ran 50 manual test cases. It passed 48. The two failures were minor phrasing nits, not factual errors. Shipped it.
ARCHITECTURE (what we built)
─────────────────────────────────────────────────────────────
Developer question
│
▼
Embedding model (text-embedding-3-small)
│
▼
Vector DB search → top-5 relevant doc chunks
│
▼
GPT-4 Turbo prompt:
"Answer using ONLY the context below. If unsure, say so."
│
▼
Response rendered in docs site
What we assumed: "If unsure, say so" would prevent hallucination
What actually happened: it didn't
The hallucinations were confident, detailed, and wrong
Three weeks after launch, a customer success manager pinged me. She was helping a
customer and noticed the bot was describing a GET /v2/events/stream endpoint
that didn't match anything in our actual API. I pulled the conversation logs.
It was worse than a few vague answers. The model had fabricated 14 separate endpoints across 847 conversation threads. Not vague suggestions. Fully specified:
POST /v2/webhooks/replay— "replays a failed webhook delivery" (with request body schema, retry logic description, and example response)GET /v2/events/stream— "returns a Server-Sent Events stream for real-time event delivery" (with SSE format example)DELETE /v2/integrations/:id/cache— "clears cached integration state" (with a 204 response description)PATCH /v2/users/bulk— "batch-updates user attributes" (with pagination and rate limit notes)
Every one of these was plausible. Looking at the list now, I still think they're exactly the kind of endpoints our API should have had. The model had interpolated from patterns in our existing routes and constructed logical neighbours. This is what makes LLM hallucination dangerous in developer tools. It doesn't make things up randomly. It makes them up reasonably, in your own naming convention, in a way that would fool you too.
Why "Answer using only the context" didn't work
This is the part that genuinely surprised me. Our system prompt explicitly told the model to use only the retrieved context. I'd read the RAG playbooks, followed the recipes. But the instruction had a gap. It told the model to use the context. It did not tell it to refuse to extrapolate beyond it.
That difference matters. When a developer asked "Can I replay a failed webhook?" and our vector search returned chunks about webhook configuration and retry policies but no chunk about a replay endpoint, the model had a choice. Say "I don't know", or synthesise a plausible answer from what it did know. GPT-4 is RLHF'd into the ground to be helpful. It synthesised.
THE FAILURE MODE ───────────────────────────────────────────────────── User: "Can I replay a failed webhook?" Retrieved context: - Chunk 1: "Webhooks retry up to 3 times on failure" - Chunk 2: "Webhook events have an event_id field" - Chunk 3: "Webhook status can be: pending, delivered, failed" No chunk: "Here is how to replay webhooks" Model reasoning (inferred): "Retries exist. Events have IDs. Failures are trackable. Logically, a replay endpoint should exist." Model output: "Yes! Use POST /v2/webhooks/replay with the event_id..." Reality: endpoint does not exist, never did
The blast radius
By the time we caught it, the damage was already distributed. Bot responses had been
shared in Slack threads, Stack Overflow answers, and internal wikis at customer
companies. I found the fabricated GET /v2/events/stream referenced in a
Medium article, two GitHub repos, and a YouTube tutorial about our platform.
The support impact, once we disabled the bot: 31 tickets in the first week all
referencing hallucinated endpoints. Four enterprise customers had already built partial
integrations against the fake routes. One had deployed production code that called
POST /v2/webhooks/replay as a background job, silently failing every time
it ran for weeks.
We had two options. Build the endpoints the AI had promised, or tell customers the documentation they'd read was wrong. We ended up doing some of both.
The fix: ground-truth validation before every response
The core architectural change was adding a validation layer between the LLM response and the user. Every API endpoint mentioned in a bot response now gets checked against our OpenAPI spec before the response is served:
REVISED ARCHITECTURE
─────────────────────────────────────────────────────────────
LLM response (raw)
│
▼
Route Extractor
(regex: [A-Z]+ /v[0-9]+/[a-z/:{}_]+ )
│
▼
OpenAPI Spec Validator
- Check each extracted route against spec
- Flag any route not present in spec
│
├─ All routes valid → serve response as-is
│
└─ Unknown route found →
Option A: replace with disclaimer
Option B: regenerate with stricter prompt
Option C: surface for human review
Also added to system prompt:
"If the retrieved context does not contain a specific endpoint,
say: 'I cannot confirm this endpoint exists. Please check the
official API reference at [URL].'"
Retrieval changed too. Instead of retrieving by semantic similarity alone, we added a hard filter: if a question contains an HTTP verb pattern, we only synthesise an answer if a retrieved chunk explicitly contains that exact route path. No match, no answer.
The prompt change that made the biggest difference was replacing:
// Before
"Answer using ONLY the context below. If unsure, say so."
// After
"Answer ONLY what is explicitly stated in the context below.
Do not infer, extrapolate, or suggest endpoints that are not
literally present in the retrieved text. If the context does
not directly answer the question, respond with exactly:
'I don't have enough context to answer this accurately.
Please refer to the API reference: [URL]'"
The word "explicitly" and the prohibition on inference cut hallucination rate from about 1.6% of responses to 0.03% in a post-fix evaluation over 30,000 conversations. I don't love that it's still nonzero, but it's in the range where the validation layer catches what the prompt misses.
The routes we actually had to build
For the four hallucinated endpoints that multiple enterprise customers had integrated
against, we made a pragmatic call: build them. POST /v2/webhooks/replay
shipped three weeks later. PATCH /v2/users/bulk was already on the
roadmap; it moved up. When your AI describes a feature confidently enough, customers
build against it and the hallucination quietly becomes a product commitment.
Lessons
- "Use only the context" is not a hallucination prevention strategy. It is a preference instruction. Models will still extrapolate when context is adjacent but incomplete. You need validation, not just instructions.
- Plausible hallucinations are more dangerous than obvious ones. A model that invents nonsense is easy to catch. A model that invents reasonable API routes in your own naming convention will fool developers for weeks.
- Ground-truth validation has to be domain-specific. For an API docs bot, validate every route against the spec. For a code-generation tool, run the code. The LLM's output must be checkable against a source of truth you own.
- Shared doc links spread hallucinations faster than you can patch them. Once a fabricated answer lands in a customer's Confluence, it's out of your control. Conversation log monitoring needs to happen in hours, not weeks.