Anthropic: explicit caching via cache_control breakpoints. Write costs 1.25× input (5-min TTL) or 2.0× (1-hour TTL). Reads cost 0.10× input — a 90% discount. Minimum 1,024 tokens.
OpenAI: automatic caching, server-side, no code changes. Cache reads cost 0.5× input — a 50% discount. Minimum 1,024 tokens, growing in 128-token increments.
The mental model: caching pays for itself when the same long prefix is reused across many requests within the TTL window. Chatbots, agents, RAG, document Q&A — yes. One-shot transforms with unique inputs — no.
- What prompt caching actually does
- Anthropic vs OpenAI — the philosophical split
- The breakeven math (when caching pays off)
- The prompt structure rule (and the mistake that breaks caches)
- Four patterns where caching is a clear win
- Three patterns where caching doesn't pay off
- Implementing it on Anthropic (code)
- Implementing it on OpenAI (no code)
- How to verify your cache is actually hitting
What Prompt Caching Actually Does
When a large language model serves a request, the most expensive thing it does — both in latency and in cost — is process the input tokens into the KV-cache that the attention layers operate on. For most real-world LLM applications, the same large chunks of input (system prompt, tool definitions, retrieved documents, conversation history) get reprocessed on every single request. That is wasted compute.
Prompt caching stores the processed state of repeated content so the provider can skip the reprocessing step. You pay a small premium on the first request to warm the cache, then a deep discount on every subsequent request that hits the same prefix. For workloads with stable reusable prefixes, this is the single largest cost lever available on either provider in 2026.
Anthropic vs OpenAI — The Philosophical Split
The two providers landed on opposite designs for the same problem. Both are defensible. The right choice depends on your workload.
| Anthropic | OpenAI | |
|---|---|---|
| How you enable it | Explicit. Mark blocks with cache_control: {"type": "ephemeral"} | Automatic. Zero code changes |
| Min prompt length | 1,024 tokens | 1,024 tokens |
| Cache write cost | 1.25× input (5-min TTL) or 2.0× (1-hour TTL) | Free |
| Cache read cost | 0.10× input (90% discount) | 0.5× input (50% discount) |
| TTL | 5 minutes or 1 hour (refreshes on hit) | Provider-managed (typically minutes; not configurable) |
| Cache granularity | Up to 4 explicit breakpoints — you choose what's cached | Longest matching prefix, in 128-token increments |
| Best for | High-volume workloads with stable, large prefixes; sustained traffic that keeps the cache warm | Anything with prompts over 1,024 tokens, zero engineering effort, sporadic traffic |
Anthropic's design assumes you know your workload and want maximum savings on it. OpenAI's design assumes you'd rather not think about it. If your traffic is high-frequency and your prompts are large and stable, Anthropic's 90% discount will dominate the comparison; if your traffic is sporadic or your prompts evolve frequently, OpenAI's automatic 50% with no write premium is the cleaner choice.
The Breakeven Math (When Caching Pays Off)
For Anthropic, the question is: does the discount on cache reads outweigh the premium on the first write?
With the 5-minute TTL (1.25× write, 0.10× read), you save the write premium back after one cache hit. Hit once, you've already broken even. Every hit after that is pure savings.
With the 1-hour TTL (2.0× write, 0.10× read), you break even after two cache hits. Use the 1-hour TTL when your requests are spaced widely enough that the 5-minute cache would expire between them, but frequently enough to hit at least twice within the hour.
Concretely: if you have a 50,000-token system-prompt-plus-tools-plus-documents block that's reused across an agent's session, caching turns the cost-per-request from "50,000 tokens at full input price" into "50,000 tokens at 10% of input price" for every request after the first. For agents that hit the model dozens of times per session, the cumulative savings on the long prefix are dramatic — often well over 80% of total spend on that workload.
For OpenAI, the math is simpler: cache writes are free, so any request over 1,024 tokens that re-uses a prefix is strictly better off. There's no breakeven calculation — just a 50% discount on the cached portion whenever you hit.
The Prompt Structure Rule (And the Mistake That Breaks Caches)
Caches always match the longest stable prefix of your prompt. The single rule that determines whether you get cache hits is:
Put stable content at the start of your prompt. Per-request content goes at the end. Anything that changes per request invalidates the cache for everything that comes after it.
The canonical order, top to bottom:
- System prompt — identical across all calls
- Tool definitions — identical across all calls
- Large reference documents — identical for the session/conversation
- Conversation history — grows but is append-only (each turn becomes part of the cacheable prefix on the next turn)
- User's current message — changes per request, goes last
The most common mistake is injecting per-request context at the top of the system prompt. "Today's date is 2026-06-18. The user's name is Jamie. The current page is /dashboard." Each of those varies per request and per user, and putting them at the top means the cache match fails on the very first character — you cache nothing.
The fix is to move per-request context to the end of the user message, or into a separate appended block. The 50,000-token system prompt and tool definitions above it stay stable across all users and all requests, and the cache hits.
Four Patterns Where Caching Is a Clear Win
1. Conversational chatbots with long system prompts
Any chatbot or assistant with a multi-thousand-token system prompt (instructions, tone, format rules, examples) is a textbook caching workload. The system prompt is identical across every user and every turn. The conversation history grows but is append-only. The new user message is small. Caching turns a steady-state cost into a tiny fraction of the un-cached cost.
2. Agentic loops with tool definitions
Agents that loop — reasoning, tool call, observation, repeat — reprocess the same tool definitions and system prompt on every iteration. A 10-iteration agent loop without caching is roughly 10× the input cost of the underlying prefix. With caching, it's roughly 1× the prefix cost plus 10× the small incremental tokens that change per iteration. This is where the savings compound most aggressively.
3. RAG (Retrieval-Augmented Generation) with stable corpora
RAG patterns where the same document or set of documents is queried multiple times benefit enormously from caching. Cache the retrieved documents on the first query; subsequent queries against the same documents within the TTL pay only the small read cost.
Important: this only works when the retrieved set is stable across queries. If your RAG fetches different documents per query, the cache misses every time. Pin the retrieval to common docs (e.g., the same product manual being queried by different users) and the cache hits beautifully.
4. Document Q&A and code-aware assistants
Any pattern where a large document or codebase context is loaded into the prompt, then multiple questions are asked against it, is ideal. Cache the document on the first question; the subsequent questions read from cache. Code editors that load the open file plus dependencies, then handle multiple completions or chat turns against that context, see the largest savings.
Three Patterns Where Caching Doesn't Pay Off
- Short prompts. Anything under 1,024 tokens gets no caching at all on either provider. If your average request is short, optimize elsewhere.
- Per-request unique prefixes. If your system prompt or document set changes every call — e.g., personalized prompts that include the user's full profile inline at the top — there's nothing to cache. Restructure the prompt (per-request data goes last) or accept that this workload isn't a caching fit.
- Very low-frequency workloads on Anthropic. If a request hits, then the next request comes more than 5 minutes later (or more than 1 hour for the long TTL), the cache has expired and you've paid the write premium for nothing. OpenAI's automatic caching avoids this because writes are free; on Anthropic, you're best off using the automatic 50% discount on OpenAI or sticking with un-cached calls if traffic is genuinely sporadic.
Implementing It on Anthropic
You mark up to four cache breakpoints in your request. Anything from the start of the prompt up to and including a breakpoint becomes cacheable. The typical pattern: one breakpoint at the end of the system prompt + tools block, and (for long conversations) one at the end of the prior conversation history.
For the 1-hour TTL (instead of the default 5-minute), pass "cache_control": {"type": "ephemeral", "ttl": "1h"}. The response object includes usage.cache_creation_input_tokens and usage.cache_read_input_tokens so you can verify whether the cache wrote, hit, or both.
Implementing It on OpenAI
There's nothing to implement. Caching is automatic on supported models for any prompt over 1,024 tokens. The only thing you need to do is structure your prompts correctly (stable content first, per-request content last) so the longest matching prefix is large enough to be worth caching.
The response includes a usage.prompt_tokens_details.cached_tokens field that tells you how many of the input tokens hit the cache on that call. If your prompts are well-structured, you should see this number climb after the first warm-up call and stay high.
How to Verify Your Cache Is Actually Hitting
The most common failure mode for caching is silent: you've enabled it, the requests succeed, but the cache hit rate is zero because something in your prompt prefix is varying that you didn't notice. Always instrument and verify.
On Anthropic: log usage.cache_creation_input_tokens and usage.cache_read_input_tokens on every response. The first request in a session should write (large creation number, zero read). Subsequent requests should read (zero creation, large read). If you're seeing repeated writes, something at the start of your prompt is varying between requests — find it and move it to the end.
On OpenAI: log usage.prompt_tokens_details.cached_tokens on every response. After the first warm-up call, this should be close to your stable-prefix length. If it stays at zero, the most common cause is that something is varying in the first 1,024 tokens of your prompt — check for injected timestamps, user IDs, dynamic instructions, or randomized examples at the top.
One Last Tip: Tooling and Frameworks
Many higher-level frameworks (LangChain, LlamaIndex, the various agent libraries) handle caching transparently in their newer versions — they reorder prompts behind the scenes to maximize the cacheable prefix, and on Anthropic they inject the cache_control markers automatically. If you're using one of these, check the version and the documentation; you may already be getting the benefit without realizing it. Conversely, if you've written your own prompt assembly, you almost certainly have at least one variable creeping into the prefix — instrument and check the cache numbers before assuming.
FAQ
Hiring engineers who can think about cost and latency tradeoffs?
Browse AI and ML roles from companies that take engineering culture seriously — with verified work-life balance and team scores from people who actually work there.
Browse AI & ML Jobs → AI Skills Hub →