The short version

Anthropic: explicit caching via cache_control breakpoints. Write costs 1.25× input (5-min TTL) or 2.0× (1-hour TTL). Reads cost 0.10× input — a 90% discount. Minimum 1,024 tokens.

OpenAI: automatic caching, server-side, no code changes. Cache reads cost 0.5× input — a 50% discount. Minimum 1,024 tokens, growing in 128-token increments.

The mental model: caching pays for itself when the same long prefix is reused across many requests within the TTL window. Chatbots, agents, RAG, document Q&A — yes. One-shot transforms with unique inputs — no.

Jump to a section

What Prompt Caching Actually Does

When a large language model serves a request, the most expensive thing it does — both in latency and in cost — is process the input tokens into the KV-cache that the attention layers operate on. For most real-world LLM applications, the same large chunks of input (system prompt, tool definitions, retrieved documents, conversation history) get reprocessed on every single request. That is wasted compute.

Prompt caching stores the processed state of repeated content so the provider can skip the reprocessing step. You pay a small premium on the first request to warm the cache, then a deep discount on every subsequent request that hits the same prefix. For workloads with stable reusable prefixes, this is the single largest cost lever available on either provider in 2026.

Anthropic vs OpenAI — The Philosophical Split

The two providers landed on opposite designs for the same problem. Both are defensible. The right choice depends on your workload.

  Anthropic OpenAI
How you enable itExplicit. Mark blocks with cache_control: {"type": "ephemeral"}Automatic. Zero code changes
Min prompt length1,024 tokens1,024 tokens
Cache write cost1.25× input (5-min TTL) or 2.0× (1-hour TTL)Free
Cache read cost0.10× input (90% discount)0.5× input (50% discount)
TTL5 minutes or 1 hour (refreshes on hit)Provider-managed (typically minutes; not configurable)
Cache granularityUp to 4 explicit breakpoints — you choose what's cachedLongest matching prefix, in 128-token increments
Best forHigh-volume workloads with stable, large prefixes; sustained traffic that keeps the cache warmAnything with prompts over 1,024 tokens, zero engineering effort, sporadic traffic

Anthropic's design assumes you know your workload and want maximum savings on it. OpenAI's design assumes you'd rather not think about it. If your traffic is high-frequency and your prompts are large and stable, Anthropic's 90% discount will dominate the comparison; if your traffic is sporadic or your prompts evolve frequently, OpenAI's automatic 50% with no write premium is the cleaner choice.

The Breakeven Math (When Caching Pays Off)

For Anthropic, the question is: does the discount on cache reads outweigh the premium on the first write?

With the 5-minute TTL (1.25× write, 0.10× read), you save the write premium back after one cache hit. Hit once, you've already broken even. Every hit after that is pure savings.

With the 1-hour TTL (2.0× write, 0.10× read), you break even after two cache hits. Use the 1-hour TTL when your requests are spaced widely enough that the 5-minute cache would expire between them, but frequently enough to hit at least twice within the hour.

Concretely: if you have a 50,000-token system-prompt-plus-tools-plus-documents block that's reused across an agent's session, caching turns the cost-per-request from "50,000 tokens at full input price" into "50,000 tokens at 10% of input price" for every request after the first. For agents that hit the model dozens of times per session, the cumulative savings on the long prefix are dramatic — often well over 80% of total spend on that workload.

For OpenAI, the math is simpler: cache writes are free, so any request over 1,024 tokens that re-uses a prefix is strictly better off. There's no breakeven calculation — just a 50% discount on the cached portion whenever you hit.

The Prompt Structure Rule (And the Mistake That Breaks Caches)

Caches always match the longest stable prefix of your prompt. The single rule that determines whether you get cache hits is:

The rule

Put stable content at the start of your prompt. Per-request content goes at the end. Anything that changes per request invalidates the cache for everything that comes after it.

The canonical order, top to bottom:

  1. System prompt — identical across all calls
  2. Tool definitions — identical across all calls
  3. Large reference documents — identical for the session/conversation
  4. Conversation history — grows but is append-only (each turn becomes part of the cacheable prefix on the next turn)
  5. User's current message — changes per request, goes last

The most common mistake is injecting per-request context at the top of the system prompt. "Today's date is 2026-06-18. The user's name is Jamie. The current page is /dashboard." Each of those varies per request and per user, and putting them at the top means the cache match fails on the very first character — you cache nothing.

The fix is to move per-request context to the end of the user message, or into a separate appended block. The 50,000-token system prompt and tool definitions above it stay stable across all users and all requests, and the cache hits.

Four Patterns Where Caching Is a Clear Win

1. Conversational chatbots with long system prompts

Any chatbot or assistant with a multi-thousand-token system prompt (instructions, tone, format rules, examples) is a textbook caching workload. The system prompt is identical across every user and every turn. The conversation history grows but is append-only. The new user message is small. Caching turns a steady-state cost into a tiny fraction of the un-cached cost.

2. Agentic loops with tool definitions

Agents that loop — reasoning, tool call, observation, repeat — reprocess the same tool definitions and system prompt on every iteration. A 10-iteration agent loop without caching is roughly 10× the input cost of the underlying prefix. With caching, it's roughly 1× the prefix cost plus 10× the small incremental tokens that change per iteration. This is where the savings compound most aggressively.

3. RAG (Retrieval-Augmented Generation) with stable corpora

RAG patterns where the same document or set of documents is queried multiple times benefit enormously from caching. Cache the retrieved documents on the first query; subsequent queries against the same documents within the TTL pay only the small read cost.

Important: this only works when the retrieved set is stable across queries. If your RAG fetches different documents per query, the cache misses every time. Pin the retrieval to common docs (e.g., the same product manual being queried by different users) and the cache hits beautifully.

4. Document Q&A and code-aware assistants

Any pattern where a large document or codebase context is loaded into the prompt, then multiple questions are asked against it, is ideal. Cache the document on the first question; the subsequent questions read from cache. Code editors that load the open file plus dependencies, then handle multiple completions or chat turns against that context, see the largest savings.

Three Patterns Where Caching Doesn't Pay Off

  1. Short prompts. Anything under 1,024 tokens gets no caching at all on either provider. If your average request is short, optimize elsewhere.
  2. Per-request unique prefixes. If your system prompt or document set changes every call — e.g., personalized prompts that include the user's full profile inline at the top — there's nothing to cache. Restructure the prompt (per-request data goes last) or accept that this workload isn't a caching fit.
  3. Very low-frequency workloads on Anthropic. If a request hits, then the next request comes more than 5 minutes later (or more than 1 hour for the long TTL), the cache has expired and you've paid the write premium for nothing. OpenAI's automatic caching avoids this because writes are free; on Anthropic, you're best off using the automatic 50% discount on OpenAI or sticking with un-cached calls if traffic is genuinely sporadic.

Implementing It on Anthropic

You mark up to four cache breakpoints in your request. Anything from the start of the prompt up to and including a breakpoint becomes cacheable. The typical pattern: one breakpoint at the end of the system prompt + tools block, and (for long conversations) one at the end of the prior conversation history.

# Python SDK example — one cache breakpoint at end of system block client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, system=[ { "type": "text", "text": large_system_prompt_plus_docs, # >1,024 tokens "cache_control": {"type": "ephemeral"} # ← cache breakpoint } ], messages=[{"role": "user", "content": user_message}] )

For the 1-hour TTL (instead of the default 5-minute), pass "cache_control": {"type": "ephemeral", "ttl": "1h"}. The response object includes usage.cache_creation_input_tokens and usage.cache_read_input_tokens so you can verify whether the cache wrote, hit, or both.

Implementing It on OpenAI

There's nothing to implement. Caching is automatic on supported models for any prompt over 1,024 tokens. The only thing you need to do is structure your prompts correctly (stable content first, per-request content last) so the longest matching prefix is large enough to be worth caching.

The response includes a usage.prompt_tokens_details.cached_tokens field that tells you how many of the input tokens hit the cache on that call. If your prompts are well-structured, you should see this number climb after the first warm-up call and stay high.

How to Verify Your Cache Is Actually Hitting

The most common failure mode for caching is silent: you've enabled it, the requests succeed, but the cache hit rate is zero because something in your prompt prefix is varying that you didn't notice. Always instrument and verify.

On Anthropic: log usage.cache_creation_input_tokens and usage.cache_read_input_tokens on every response. The first request in a session should write (large creation number, zero read). Subsequent requests should read (zero creation, large read). If you're seeing repeated writes, something at the start of your prompt is varying between requests — find it and move it to the end.

On OpenAI: log usage.prompt_tokens_details.cached_tokens on every response. After the first warm-up call, this should be close to your stable-prefix length. If it stays at zero, the most common cause is that something is varying in the first 1,024 tokens of your prompt — check for injected timestamps, user IDs, dynamic instructions, or randomized examples at the top.

One Last Tip: Tooling and Frameworks

Many higher-level frameworks (LangChain, LlamaIndex, the various agent libraries) handle caching transparently in their newer versions — they reorder prompts behind the scenes to maximize the cacheable prefix, and on Anthropic they inject the cache_control markers automatically. If you're using one of these, check the version and the documentation; you may already be getting the benefit without realizing it. Conversely, if you've written your own prompt assembly, you almost certainly have at least one variable creeping into the prefix — instrument and check the cache numbers before assuming.

FAQ

What is prompt caching?+
Prompt caching stores the processed state of repeated content in a prompt — system instructions, large documents, tool definitions — so subsequent requests can read it from cache instead of reprocessing it from scratch. The provider charges you a small premium to write the cache and a deep discount (50% on OpenAI, 90% on Anthropic) on every read within the cache's TTL.
What's the minimum prompt length for caching?+
1,024 tokens on both Anthropic and OpenAI. Prompts shorter than that don't get cached at all. Caches also grow in 128-token increments after the 1,024-token minimum on OpenAI.
How does Anthropic's prompt caching pricing work?+
Anthropic charges 1.25× the standard input token price to write to a 5-minute TTL cache, or 2.0× to write to a 1-hour TTL cache. Every read from the cache within the TTL costs only 0.10× the standard input price — a 90% discount.
How does OpenAI's prompt caching pricing work?+
OpenAI's prompt caching is automatic and free to write. Cached input tokens are billed at 50% of the standard input rate on GPT-4o and supported newer models. The cache kicks in automatically for any prompt over 1,024 tokens, growing in 128-token increments.
When does prompt caching NOT pay off?+
Three cases. (1) Short prompts — anything under 1,024 tokens gets no caching at all. (2) Workloads where the prompt prefix changes every request. (3) Very low-frequency workloads on Anthropic — if requests are spaced further apart than the TTL, the write premium becomes pure overhead. OpenAI's automatic caching avoids this case since writes are free.
Do I have to change my code to use prompt caching?+
On OpenAI: no — caching is fully automatic. On Anthropic: yes — you opt in by marking which blocks of your prompt you want cached via cache_control breakpoints. Anthropic gives you a deeper discount and explicit control; OpenAI gives you smaller savings with zero engineering effort.
What's the best prompt structure for caching?+
Put the stable, reusable content at the start: system prompt → tool definitions → large reference documents → conversation history → user's current message. Caching always matches the longest stable prefix, so anything that changes per request must come last.

Hiring engineers who can think about cost and latency tradeoffs?

Browse AI and ML roles from companies that take engineering culture seriously — with verified work-life balance and team scores from people who actually work there.

Browse AI & ML Jobs → AI Skills Hub →