Anthropic says prompt caching can cut your Claude API bill “up to 90%.” Every blog post repeats the number. Meanwhile your monthly invoice keeps climbing.
Two questions are worth answering before you touch a line of code: when does that 90% actually happen, and how do you know if your workload qualifies? Most articles skip both. They show you the syntax and wave at the savings.
This guide gives you the formula, the code with cost-tracking instrumentation, and the honest list of cases where prompt caching for the Claude API is a waste of effort.
What Prompt Caching Actually Does
The first time Claude sees a prompt prefix, you pay a 25% premium on those tokens — that’s the “cache write,” billed at 1.25x normal input price. For the next five minutes, any request with the identical prefix gets billed at 10% of normal input. That’s the “cache read.” Same model, same output, dramatically different bill.
You opt in by adding a cache_control marker to the parts of your prompt that don’t change between calls — system prompts, tool definitions, large reference documents, few-shot examples. Dynamic stuff (the user message, fresh data) goes after the marker.
The five-minute TTL is the constraint everything else hinges on. Three reads in that window and the math starts breaking your way. One read, and you paid a 25% premium for the privilege of caching nothing.
10% of normal input sounds great. The question is what that means in dollars for the way you actually call the API.
The ROI Formula: When Caching Actually Pays Off
Run this on a napkin before you implement anything:
monthly_savings = (cached_tokens × reads × normal_price × 0.9) − (cached_tokens × writes × normal_price × 0.25)
Plain English: every cache read returns 90% of the normal input cost as savings. Every cache write costs you 25% extra upfront. You need at least three reads per five-minute window for the math to clear zero. Anything less and you’re paying a premium to cache nothing.
A worked example. I consult with clients using a 4,000-token system prompt — my diagnostic framework, output format, voice rules — across 50 client conversations a day. Each conversation averages five back-and-forth calls. That’s 250 reads per day, mostly clustered inside their own five-minute windows. At Sonnet pricing, the input-side savings land around $40 to $60 per day. Call it $1,500 a month for a one-line code change.
Now the counter-example. A nightly batch script that calls Claude once with a 10K context document. One write, no reads, nothing to amortize. Caching makes that workload more expensive, not less. Skip it.
The formula tells you which side of the line you’re on before you ship a thing. The next question is how to confirm the math is actually working in production.
The Code: Implement Caching and Track Every Dollar
Caching itself is one line. The interesting work is the instrumentation.
import anthropic
client = anthropic.Anthropic()
NORMAL_PRICE = 3 / 1_000_000 # Sonnet input, per token
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": LARGE_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": user_query}],
)
u = response.usage
write = u.cache_creation_input_tokens or 0
read = u.cache_read_input_tokens or 0
miss = u.input_tokens
hit_rate = read / (read + miss) if (read + miss) else 0
saved = (read * NORMAL_PRICE * 0.9) - (write * NORMAL_PRICE * 0.25)
print(f"hit_rate={hit_rate:.0%} saved=${saved:.4f}")
Two fields are the only ground truth that caching is working: cache_creation_input_tokens and cache_read_input_tokens. If cache_read_input_tokens stays at zero across more than a couple of calls, the cache isn’t hitting — and the rest of the diagnosis happens in the next section.
One footnote that ruins implementations silently: the minimum cacheable prefix is 1,024 tokens for Sonnet and Opus, 2,048 for Haiku. Sub-threshold prompts never cache, never error. They just pay full price forever while you wonder why your savings dashboard is empty.
The instrumentation is what separates “I turned caching on” from “I know it saved me $42 yesterday.” Most cache deployments fail at the second part — and not because the feature is broken.
The 5 Mistakes That Silently Kill Your Cache Hit Rate
If your cache_read_input_tokens is at zero, one of these is the cause. Always.
- A timestamp or request ID in the cached prefix. “Today is 2026-04-30” inside a system prompt invalidates the cache every midnight. One character difference equals zero hits.
- Dynamic content before the cache_control marker. User name, session ID, any per-request value that lives inside the cached section instead of after it. Caching needs an exact-byte prefix match, not “mostly the same.”
- Whitespace and ordering drift. Reformat the system prompt for readability between deploys, swap two instruction blocks, change tabs to spaces — every existing cache key dies.
- Switching tool definitions. Tools are part of the cached prefix. Adding, removing, or reordering one tool invalidates everything cached behind it.
- Parallel calls in the first five seconds. The first call writes the cache. Concurrent calls miss because the write hasn’t landed yet. Send the first request, wait for the response, then fan out.
Audit checklist: log cache_read_input_tokens for 50 consecutive calls. If your hit rate is under 60%, walk this list before changing anything else.
Even with all five fixed, there are workloads where caching is the wrong tool entirely.
When Not to Cache (And When to Stack It With the Batch API)
Skip caching when calls are spaced more than five minutes apart with no traffic in between — every call becomes a fresh write. Skip when every prompt is genuinely unique, when the prefix is under the minimum threshold, or when you’re running a single nightly job. The 25% write premium will eat any savings.
When it does fit, stack it with the Batch API. Batch is 50% off list price across the board. Caching cuts the input portion further. For workloads where you’re processing many requests against a shared knowledge base — same prefix, different rows — you can land 85-90% total reduction on the input side.
The pattern: cache the knowledge base, queue the per-row prompts as a Batch job, run it overnight. Same outputs. A fraction of the bill.
That’s the full picture. The only thing left is whether to flip it on this week.
The Bottom Line
If your prefix is over 1,024 tokens AND you average three reads per five-minute window, turn caching on this afternoon. The change is one line. The savings show up in the next response.
If you miss either condition, leave it off and revisit when traffic grows. The 25% write premium will eat any single-call workload alive.
The 90% number is real. It’s just not yours by default — it’s yours after you measure your own hit rate, fix the five mistakes, and stop trusting the marketing version. If you’re still wiring up the basics, the Claude API tutorial in Python gets you to your first call in 15 minutes — then come back here when you’re ready to optimize the bill.