Our AI Documentation Bot Invented 14 API Routes That Never Existed — 6,000 Users Integrated Against Them
← Back
March 16, 2026AI11 min read

Our AI Documentation Bot Invented 14 API Routes That Never Existed — 6,000 Users Integrated Against Them

Published March 16, 202611 min read

Tuesday afternoon. A developer from one of our largest enterprise customers opens a support ticket. "Your POST /v2/webhooks/replay endpoint keeps returning 404. Has it been deprecated?"

I check the route table. POST /v2/webhooks/replay has never existed. Our shiny new AI documentation assistant invented it, described it in detail with request and response examples, rate limit notes, error codes, the full treatment. At least 6,000 developers had read that page by the time we noticed.

This is what happens when you deploy an LLM without a ground-truth validation layer.

The setup: a docs bot that seemed to work perfectly

We'd built a documentation assistant for our REST API. Standard RAG setup using GPT-4 Turbo: developers ask questions in natural language, we retrieve relevant chunks from our OpenAPI spec and markdown docs, and GPT-4 writes a helpful answer on top of that context.

In testing, it was good. Genuinely good. It answered questions about auth flows, pagination patterns, webhook configurations correctly, with accurate code examples. We ran 50 manual test cases. It passed 48. The two failures were minor phrasing nits, not factual errors. Shipped it.

  ARCHITECTURE (what we built)
  ─────────────────────────────────────────────────────────────
  
  Developer question
       │
       ▼
  Embedding model (text-embedding-3-small)
       │
       ▼
  Vector DB search → top-5 relevant doc chunks
       │
       ▼
  GPT-4 Turbo prompt:
  "Answer using ONLY the context below. If unsure, say so."
       │
       ▼
  Response rendered in docs site
  
  What we assumed: "If unsure, say so" would prevent hallucination
  What actually happened: it didn't

The hallucinations were confident, detailed, and wrong

Three weeks after launch, a customer success manager pinged me. She was helping a customer and noticed the bot was describing a GET /v2/events/stream endpoint that didn't match anything in our actual API. I pulled the conversation logs.

It was worse than a few vague answers. The model had fabricated 14 separate endpoints across 847 conversation threads. Not vague suggestions. Fully specified:

  • POST /v2/webhooks/replay — "replays a failed webhook delivery" (with request body schema, retry logic description, and example response)
  • GET /v2/events/stream — "returns a Server-Sent Events stream for real-time event delivery" (with SSE format example)
  • DELETE /v2/integrations/:id/cache — "clears cached integration state" (with a 204 response description)
  • PATCH /v2/users/bulk — "batch-updates user attributes" (with pagination and rate limit notes)

Every one of these was plausible. Looking at the list now, I still think they're exactly the kind of endpoints our API should have had. The model had interpolated from patterns in our existing routes and constructed logical neighbours. This is what makes LLM hallucination dangerous in developer tools. It doesn't make things up randomly. It makes them up reasonably, in your own naming convention, in a way that would fool you too.

Why "Answer using only the context" didn't work

This is the part that genuinely surprised me. Our system prompt explicitly told the model to use only the retrieved context. I'd read the RAG playbooks, followed the recipes. But the instruction had a gap. It told the model to use the context. It did not tell it to refuse to extrapolate beyond it.

That difference matters. When a developer asked "Can I replay a failed webhook?" and our vector search returned chunks about webhook configuration and retry policies but no chunk about a replay endpoint, the model had a choice. Say "I don't know", or synthesise a plausible answer from what it did know. GPT-4 is RLHF'd into the ground to be helpful. It synthesised.

  THE FAILURE MODE
  ─────────────────────────────────────────────────────
  
  User: "Can I replay a failed webhook?"
  
  Retrieved context:
  - Chunk 1: "Webhooks retry up to 3 times on failure"
  - Chunk 2: "Webhook events have an event_id field"
  - Chunk 3: "Webhook status can be: pending, delivered, failed"
  
  No chunk: "Here is how to replay webhooks"
  
  Model reasoning (inferred):
  "Retries exist. Events have IDs. Failures are trackable.
   Logically, a replay endpoint should exist."
  
  Model output:
  "Yes! Use POST /v2/webhooks/replay with the event_id..."
  
  Reality: endpoint does not exist, never did

The blast radius

By the time we caught it, the damage was already distributed. Bot responses had been shared in Slack threads, Stack Overflow answers, and internal wikis at customer companies. I found the fabricated GET /v2/events/stream referenced in a Medium article, two GitHub repos, and a YouTube tutorial about our platform.

The support impact, once we disabled the bot: 31 tickets in the first week all referencing hallucinated endpoints. Four enterprise customers had already built partial integrations against the fake routes. One had deployed production code that called POST /v2/webhooks/replay as a background job, silently failing every time it ran for weeks.

We had two options. Build the endpoints the AI had promised, or tell customers the documentation they'd read was wrong. We ended up doing some of both.

The fix: ground-truth validation before every response

The core architectural change was adding a validation layer between the LLM response and the user. Every API endpoint mentioned in a bot response now gets checked against our OpenAPI spec before the response is served:

  REVISED ARCHITECTURE
  ─────────────────────────────────────────────────────────────
  
  LLM response (raw)
       │
       ▼
  Route Extractor
  (regex: [A-Z]+ /v[0-9]+/[a-z/:{}_]+ )
       │
       ▼
  OpenAPI Spec Validator
  - Check each extracted route against spec
  - Flag any route not present in spec
       │
       ├─ All routes valid → serve response as-is
       │
       └─ Unknown route found → 
            Option A: replace with disclaimer
            Option B: regenerate with stricter prompt
            Option C: surface for human review
  
  Also added to system prompt:
  "If the retrieved context does not contain a specific endpoint,
   say: 'I cannot confirm this endpoint exists. Please check the
   official API reference at [URL].'"

Retrieval changed too. Instead of retrieving by semantic similarity alone, we added a hard filter: if a question contains an HTTP verb pattern, we only synthesise an answer if a retrieved chunk explicitly contains that exact route path. No match, no answer.

The prompt change that made the biggest difference was replacing:

// Before
"Answer using ONLY the context below. If unsure, say so."

// After  
"Answer ONLY what is explicitly stated in the context below.
 Do not infer, extrapolate, or suggest endpoints that are not
 literally present in the retrieved text. If the context does
 not directly answer the question, respond with exactly:
 'I don't have enough context to answer this accurately.
  Please refer to the API reference: [URL]'"

The word "explicitly" and the prohibition on inference cut hallucination rate from about 1.6% of responses to 0.03% in a post-fix evaluation over 30,000 conversations. I don't love that it's still nonzero, but it's in the range where the validation layer catches what the prompt misses.

The routes we actually had to build

For the four hallucinated endpoints that multiple enterprise customers had integrated against, we made a pragmatic call: build them. POST /v2/webhooks/replay shipped three weeks later. PATCH /v2/users/bulk was already on the roadmap; it moved up. When your AI describes a feature confidently enough, customers build against it and the hallucination quietly becomes a product commitment.

Lessons

  • "Use only the context" is not a hallucination prevention strategy. It is a preference instruction. Models will still extrapolate when context is adjacent but incomplete. You need validation, not just instructions.
  • Plausible hallucinations are more dangerous than obvious ones. A model that invents nonsense is easy to catch. A model that invents reasonable API routes in your own naming convention will fool developers for weeks.
  • Ground-truth validation has to be domain-specific. For an API docs bot, validate every route against the spec. For a code-generation tool, run the code. The LLM's output must be checkable against a source of truth you own.
  • Shared doc links spread hallucinations faster than you can patch them. Once a fabricated answer lands in a customer's Confluence, it's out of your control. Conversation log monitoring needs to happen in hours, not weeks.
Share this
← All Posts11 min read