Prompt caching lets you pay a discounted rate for input tokens the provider has recently processed for you. If your system prompt, tool definitions, or document context is large and reused across requests, caching typically cuts input token costs by 60–90% and reduces first-token latency. The trick is architecting your prompt so the reusable portion sits at the very top, then explicitly placing cache breakpoints (on Anthropic/Bedrock) or letting automatic prefix matching kick in (on OpenAI/Gemini/DeepSeek).
If you run any non-trivial LLM application in production, prompt caching is probably the single largest cost lever you have. For agentic systems with long tool definitions, RAG applications with big context windows, or coding assistants that pass the same codebase context on every turn, well-designed caching turns runaway API bills into background noise. And yet a surprising number of teams either don't use it at all or use it wrong.
This piece walks through what prompt caching actually is at the mechanical level, when it pays off, how the major providers implement it, and the specific prompt-design patterns that maximize cache hits. If you've been treating prompt caching as a nice-to-have or a "we'll get to it later" line item, this is the guide to change your mind.
What prompt caching actually is
When a transformer processes a prompt, it produces a large internal state — commonly known as the KV cache — that represents "everything the model has already thought about" as it prepares to generate the next token. That state is expensive to compute; for long prompts, it dominates both the latency and the cost of a request.
Prompt caching is a provider-side optimization: when the provider sees a request whose prefix matches a prefix it has already processed recently, it can reuse the stored KV cache instead of recomputing it from scratch. The result is (a) a much lower per-token bill for the cached portion, and (b) a much lower latency to first token, because the model doesn't have to re-ingest the beginning of the prompt.
The mental model is simple: the top of your prompt is expensive to compute the first time and nearly free after that, as long as (i) the same prefix is reused, (ii) it happens within a short time window, and (iii) the provider actually has capacity to hold the cached state. Everything below the reused prefix — the bit that changes per request — still gets processed normally.
When prompt caching pays off (and when it doesn't)
Caching provides real value when the ratio of "reused tokens" to "unique tokens per request" is high. It provides very little value when almost every request has a novel prompt.
High-leverage cases — these are where caching is a no-brainer:
- Long system prompt with tool definitions. Modern agentic systems often have 5,000–20,000 tokens of tool schemas, examples, and behavioral guidance at the top of every call. That prefix is identical across turns and users.
- Document QA and RAG on stable documents. If a user is chatting with a long PDF, contract, or codebase, the document itself is a giant reusable prefix.
- Coding assistants. The same file, module, or project context is passed on every turn of a coding session. Caching the codebase prefix is one of the biggest reasons Cursor, Copilot-style products, and Claude Code can offer subscription pricing at scale.
- Customer support agents. Static knowledge base, personality definition, and conversation history all sit in a reusable block.
- Few-shot prompts with many exemplars. Anything with 3+ worked examples in the prefix.
Low-leverage cases — where caching might cost more than it saves:
- Short one-shot completions. If your entire prompt is 200 tokens, there's nothing to cache.
- Prompts where the top changes per request. If you inject the user's name, current date, or query at the very start, you've invalidated the cache before it can help you.
- Very low request volume. Some providers charge a modest premium on the first "cache write." If a prefix is used only once or twice in the TTL window, that write cost may exceed the read savings.
How the major providers approach it in 2026
Every serious provider now supports some form of prompt caching, but the mechanics differ. Here's how the current landscape shakes out.
| Provider | How it works | Notes |
|---|---|---|
| Anthropic (Claude) | Explicit: you place cache_control markers on message blocks. Ephemeral cache with 5-min default TTL; optional 1-hour tier at higher cost. |
Cache reads at ~10% of standard input rate. Cache writes typically at a modest premium. Explicit control means you can precisely choose which portions of the prompt to cache. |
| OpenAI | Automatic prefix matching. No explicit markers required. Cache is best-effort with short TTL. | Cached input tokens billed at a fraction of standard input rate. Simpler to adopt; less control over what gets cached. |
| Google Gemini | Two modes: implicit (automatic, short TTL) and explicit (context caching API with configurable TTL). | Explicit context caching is well-suited for long-lived reference documents used across many requests. |
| Amazon Bedrock | Explicit cache markers for supported models (including Claude and Nova families). | Behavior largely mirrors the underlying model provider's spec. |
| DeepSeek | Automatic prefix caching for the API. | Popular for its aggressive cache-hit pricing; strong choice for high-volume applications with stable prefixes. |
The pricing structure varies but the pattern is consistent: cache-hit input tokens are billed at a small fraction of standard input tokens. For long stable prefixes, this dominates the total cost calculation.
The core rule: order your prompt from most stable to least stable
The single most important design pattern is to structure every prompt so that the parts that never change go at the top, the parts that occasionally change go in the middle, and the parts that change every request go at the bottom. In agentic systems, that ordering typically looks like:
- System prompt (the assistant's persona, high-level guidance)
- Tool definitions (JSON schemas for the tools the model can call)
- Long stable context (documents, codebase, knowledge base)
- Few-shot examples (if used, and if stable)
- Conversation history up to the previous turn
- The current user message
If you invert this ordering — for example, if you put the current user query at the top for "clarity" — you've defeated caching entirely. Every request will see a different first token and hit no cache.
Concrete example: caching in the Anthropic SDK
Here's what an explicit cache breakpoint looks like on the Anthropic API. The cache_control marker tells the model "everything up to and including this block is reusable." All subsequent messages in the same conversation will match against the cached prefix.
The cost impact is easy to observe: the response object returns cache_creation_input_tokens and cache_read_input_tokens. On the first request, you'll see a nonzero cache_creation value; on subsequent requests within the TTL, you'll see cache_read take the place of standard input tokens at a fraction of the price.
For applications integrated through the Vercel AI Gateway or a similar routing layer, you can often get automatic caching benefits without changing your application code — the gateway takes care of the provider-specific mechanics.
Common pitfalls (and how to avoid them)
1. Injecting dynamic content near the top
The most common failure mode: a well-meaning developer adds "The current date is {today}" to the top of the system prompt for freshness. Because the date changes every day (or every request), the cache is invalidated. Move dynamic content to the bottom of the prompt, or use dedicated fields that the provider knows to exclude from cache matching.
2. Treating the cache as durable
Provider caches are ephemeral. The default TTL on most providers is a few minutes. If your application makes one request every ten minutes to the same endpoint, you may see no cache hits at all. Design your workload assuming a cold cache is normal; use caching for latency-sensitive high-frequency paths, not for once-a-day batch jobs.
3. Not measuring cache hit rate
Every observable API returns per-request cache statistics. Log them. Watch the ratio of cache reads to cache writes over time. If your hit rate is below 50%, something in your prompt structure is silently invalidating the cache. This is the most common source of confusion — developers deploy caching, don't measure it, and assume it's working. It often isn't.
4. Placing the breakpoint too aggressively
On providers with explicit cache markers (Anthropic, Bedrock), a common mistake is to add a breakpoint at every possible position "just in case." Each cache breakpoint has an associated write cost. Placing them thoughtfully — usually one after tool definitions, one after the long stable context — is better than sprinkling them everywhere.
5. Cache size limits
Providers cache above a minimum token threshold (usually around 1,024 tokens on Anthropic; other providers have their own floors). If your prefix is short, caching won't activate at all. Consolidate rather than splitting into many small blocks.
6. Prompt-tinkering drift
Over the course of a week, developers make small edits to the system prompt — a comma here, a word there. Each edit invalidates the cache. Treat the "stable" prefix as a versioned artifact. Update it deliberately, test the new version, and accept that each update triggers a fresh cache warmup.
Advanced patterns
Cache-friendly retrieval
In RAG systems, the naive pattern is to retrieve the top-K documents fresh on every user turn and inject them into the prompt. This makes every prompt unique and defeats caching. A cache-friendly pattern is to batch the retrieved context at the conversation level: pull the relevant documents once for the whole session, place them in the reusable prefix, and let the model reference them across turns. You'll pay for slightly more tokens per turn, but the cache-hit savings usually swamp the extra token cost.
Two-tier caching for long documents
If you have very long stable documents that dominate the prefix (say, a 100-page contract that a user is asking questions about), and the provider offers a longer-TTL cache tier (Anthropic's 1-hour cache, Gemini's explicit context cache), the math typically tips in favor of paying the higher write cost once and reading the cache many times over the session.
Segmenting by role
When you run different agents (e.g., a "researcher" and a "writer") in a pipeline, they often share substantial common context but have distinct prompts. If you can split the shared context into its own cached block and then append role-specific instructions, both agents get cache hits on the shared portion.
Warming the cache
For latency-sensitive user-facing applications, you can proactively make a "warmup" request just before an expected burst of traffic (e.g., a scheduled meeting, a batch job) so the first user-facing request already hits a warm cache. This is a niche technique but powerful for consumer chat applications with predictable spikes.
How much can you actually save?
The real savings depend heavily on your ratio of "reused prefix tokens" to "unique per-request tokens." A useful back-of-envelope:
- Small prefix, small variable tail (~500 / ~500 tokens): Caching saves modest amounts on the input side. Latency benefits still visible.
- Large prefix, small variable tail (~20,000 / ~500 tokens): Input cost drops dramatically — often 80%+ on the input line item. This is the sweet spot for agentic systems and RAG.
- Long conversation with growing history (~30,000 / ~2,000 tokens): Massive savings, especially with tiered TTL. Coding assistants live here.
For a real application, model your prompt structure on paper first: total tokens in the stable prefix, expected variable tokens per turn, expected turns per user session, expected sessions per minute. The math falls out almost immediately from there.
A production checklist
Before you ship prompt caching
- Reorder your prompt. Stable content at the top, dynamic content at the bottom. Every dynamic injection in the prefix is a cache miss waiting to happen.
- Freeze your system prompt as a versioned artifact. No more silent edits. Version it, deploy it, monitor cache hit rate as a metric.
- Instrument cache statistics. Log
cache_readandcache_creationtokens per request. Add a dashboard for cache hit rate. - Choose your breakpoints deliberately. Usually one after tool definitions, one after long stable context. Not one everywhere.
- Consider TTL tier. Ephemeral for short bursts, longer TTL for stable reference documents used across sessions.
- Test with a cold cache. Your worst-case latency is a cache miss. Make sure the app is still usable in that state.
- Model expected cost. Estimate savings vs write premium. Confirm the math is positive for your workload before rolling out.
- Watch for prompt drift. A "small tweak" to the system prompt can silently drop your cache hit rate. Alert on regressions.
Why this matters for hiring in AI
Prompt caching is one of the diagnostic questions that separates "has actually shipped an LLM app" from "has read a few articles." When you're interviewing AI engineers, ask them how they'd cut a $50k/mo LLM bill in half. If they don't mention caching in the first thirty seconds, they haven't built at scale. If they can explain the specific tradeoffs — write cost vs read savings, TTL selection, prefix ordering, hit-rate observability — they've done the work.
For AI engineers building this expertise: production-grade prompt caching, together with model routing, structured output design, and evaluation harnesses, is the specific bundle of skills that separates AI engineers from ML engineers in the modern job market. Companies hiring for AI infrastructure and LLMOps roles screen for exactly this kind of production judgment.
Browse LLM engineering & ML infra roles
Companies building AI-native products need engineers who understand the production reality of prompt caching, evaluation, model routing, and cost optimization. See who's hiring.
Browse AI/ML Jobs → Explore AI Skills Content →