The Claude API feature that cut my AI costs by 80% overnight
← Back
March 31, 2026Claude7 min read

The Claude API feature that cut my AI costs by 80% overnight

Published March 31, 20267 min read

I was building an internal code review tool on top of Claude when our monthly Anthropic bill crossed a threshold that made my engineering manager raise an eyebrow. We were sending the same 50,000-token system prompt on every API call: our coding standards doc, architecture guide, and team conventions. I suspected this was wasteful and assumed that's just how LLM APIs worked.

Then I stumbled into two words in the Anthropic docs: cache_control.


What most engineers do

When you build with the Claude API, the natural pattern is: every call gets the full context. System prompt, few-shot examples, relevant documentation, then the actual user question at the end.

This works. The results are good. The problem is that real production system prompts get large, and you're paying full price on every token on every call, even when 95% of the content hasn't changed between requests.

For a code review tool that runs on every PR, it adds up fast. Our system prompt was around 50k tokens. At Claude Sonnet's pricing, that's roughly $0.15 per review in input tokens alone. 200 PRs per week works out to about $30 a week in context overhead before any actual reasoning happens.

What I found

Anthropic's prompt caching lets you mark specific blocks of your prompt as cacheable. The first time you send a request, Claude stores those blocks server-side. On subsequent calls, if the cached content matches, Claude pulls it from cache instead of reprocessing it, at roughly 10% of the original token cost.

The API is almost embarrassingly simple. You add a cache_control key to any content block:

typescript
const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-5',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: yourLargeCodingStandardsDoc, // ~50,000 tokens
      cache_control: { type: 'ephemeral' }, // 👈 mark as cacheable
    },
  ],
  messages: [
    {
      role: 'user',
      content: 'Review this diff for issues:

' + prDiff,
    },
  ],
});

The usage object in the response tells you exactly what happened:

typescript
// First call — cache write
{
  input_tokens: 53,                   // the PR diff
  cache_creation_input_tokens: 50234, // stored to cache
  cache_read_input_tokens: 0
}

// Subsequent calls — cache hit!
{
  input_tokens: 61,                   // new PR diff
  cache_creation_input_tokens: 0,
  cache_read_input_tokens: 50234      // retrieved from cache 🎉
}

Why it's better

The numbers are striking. Here's what changed for our code review tool after enabling caching:

WITHOUT CACHING
┌─────────────────────────────────────────────┐
│  Every API call                             │
│  ┌──────────────────────────────────────┐   │
│  │  System prompt (50k tokens)  $0.150  │   │
│  │  + PR diff    (1k tokens)    $0.003  │   │
│  └──────────────────────────────────────┘   │
│  Total per review:        ~$0.153           │
│  200 reviews/week:        ~$30.60           │
└─────────────────────────────────────────────┘

WITH PROMPT CACHING
┌─────────────────────────────────────────────┐
│  First call (cache write, +25% cost)        │
│  ┌──────────────────────────────────────┐   │
│  │  System prompt (50k @ 125%)  $0.188  │   │
│  │  + PR diff    (1k tokens)    $0.003  │   │
│  └──────────────────────────────────────┘   │
│                                             │
│  All subsequent calls (cache read, -90%)    │
│  ┌──────────────────────────────────────┐   │
│  │  System prompt from cache    $0.015  │   │
│  │  + PR diff    (1k tokens)    $0.003  │   │
│  └──────────────────────────────────────┘   │
│  199 remaining reviews:   ~$3.57            │
│                                             │
│  Weekly total:   $3.76   (was $30.60)       │
│  Savings:        88% reduction  ✅          │
└─────────────────────────────────────────────┘

There's a latency benefit too. Cache hits skip the tokenization and processing of the cached content, so time-to-first-token drops on large-context calls. For our CI tool, review latency went from around 8 seconds to under 3 on cache hits.

A few details worth knowing:

  • Cached content persists for at least 5 minutes and is refreshed every time it's used. For frequently-called tools, it effectively stays alive indefinitely.
  • The first call costs 25% more because you're paying to store the cache. Every hit after that is 90% cheaper. Break-even is at 2 calls.
  • You can mark up to 4 cache breakpoints in a single prompt, which is handy for layering a system prompt, few-shot examples, and a big reference doc.
  • The content has to be identical. A single character change invalidates the cache for that block, so design your static sections to be truly static.

Real use cases

CI/CD code review. Coding standards, architecture conventions, example patterns, all cached across every PR. Only the diff changes per call. This is exactly the pattern that dropped our bill by 88%.

Documentation chatbots. Cache the entire docs corpus in the system prompt. Users ask questions; you only pay for the question tokens on cache hits. If your docs are 100k tokens, that's 90% savings on every query after the first.

Multi-turn conversations with large context. If your chat history references a large document loaded at the start of the conversation, cache it. Subsequent turns only pay for new messages.

Agentic pipelines. When an agent loops over many items (analyzing 200 log entries with the same analysis instructions, say), cache the instructions block. Per-item cost drops to roughly 10%.

Try it

Here's a drop-in helper that wraps your Claude calls with caching for any static content block:

typescript
import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic();

async function callClaudeWithCaching({
  staticContext,  // your large, stable content
  userMessage,    // what changes each call
  model = 'claude-sonnet-4-5',
  maxTokens = 1024,
}: {
  staticContext: string;
  userMessage: string;
  model?: string;
  maxTokens?: number;
}) {
  const response = await anthropic.messages.create({
    model,
    max_tokens: maxTokens,
    system: [
      {
        type: 'text' as const,
        text: staticContext,
        cache_control: { type: 'ephemeral' as const },
      },
    ],
    messages: [
      {
        role: 'user' as const,
        content: userMessage,
      },
    ],
  });

  const { usage } = response;
  const cacheHit = (usage.cache_read_input_tokens ?? 0) > 0;

  console.log(
    cacheHit
      ? 'Cache hit — saved ' + usage.cache_read_input_tokens + ' tokens'
      : 'Cache miss — stored ' + usage.cache_creation_input_tokens + ' tokens'
  );

  return response;
}

// Usage
const result = await callClaudeWithCaching({
  staticContext: yourCodingStandardsDoc, // stays in cache between calls
  userMessage: 'Review this diff:

' + prDiff,
});

First run you'll see "Cache miss, stored X tokens". Every subsequent call with the same staticContext logs "Cache hit", and you'll be paying about 10 cents on the dollar for those tokens.

One thing I wish I'd known earlier: prompt caching works on Claude Haiku, Sonnet, and Opus. No feature flag, no support ticket. You add the cache_control key and it works. (I genuinely felt a bit stupid for not using it sooner.)

If you're building anything that calls Claude repeatedly with the same large context, this is the highest-ROI change you can make to your API usage today. Break-even is literally two calls.

Share this
← All Posts7 min read