What is prompt caching? Why it matters for your business

Two people at an office desk reviewing printed pages and a laptop together
TL;DR

Prompt caching stores the processed state of the unchanging prefix of your prompt, typically a system prompt or policy document, and reuses it on later requests at roughly 10% of normal input cost. For long, stable prefixes used many times within a short window, savings of 60 to 80% on input cost are routine. For short or highly variable prompts, it adds complexity without meaningful benefit.

Key takeaways

- Prompt caching stores the processed view of the unchanging prefix of your prompt and reuses it at roughly 10% of standard input cost. - It only activates above a vendor minimum, typically 1,024 tokens on OpenAI and Anthropic, 2,048 to 4,096 on Google. - Default cache lifetime is 5 minutes on the major vendors. Sparse traffic that does not cluster in that window pays write premiums without recovering them. - Caching also reduces time-to-first-token noticeably, often 13 to 31%, which can matter more than cost for chat-style applications. - The decision rule is workload shape, not vendor pitch. Long stable prefix plus high repeat rate equals enable. Short or variable prompts equal skip.

A 30-staff professional services firm I worked with earlier this year had built an internal AI assistant on top of a 15,000-token policy and procedures document. Every query the team typed loaded the full policy as a system prompt so the answers stayed consistent with how the firm actually operated. About 200 queries a day, five days a week. On Claude Sonnet at standard input rates, the system-prompt portion alone was running close to £160 a month, and the bill was growing as adoption spread.

The fix was a setting they did not know existed. Once prompt caching was switched on and the prompt was structured so the policy sat at the front, the same workload cost roughly £16 a month. Same assistant. Same answers. Ten times less spend on the cached portion.

That is the lever. It is worth understanding properly because the autopilot version helps a little, and the engineered version helps a lot more.

What is prompt caching?

Prompt caching stores the model’s processed view of the unchanging prefix of your prompt, typically a system prompt, policy document, codebase, or any fixed context, and reuses it on subsequent requests within a short window. On the cached portion, the vendor bills you at roughly 10% of standard input cost. The user-specific part of the prompt, the actual question, is still processed fresh each time.

Technically, what is being stored is the key and value matrices the model computes during its attention layers, the “KV cache” you will see in vendor documentation. Those matrices are expensive to compute on a long prompt. Caching skips the recomputation when the same prefix arrives again within the TTL, default 5 minutes on the major vendors.

From your side as an owner, the practical move is structural. Put the things that do not change first, the system prompt, the policy, the FAQ. Put the things that do change last, the user’s question, the timestamp, the session ID. If anything in the first 1,024 tokens shifts between requests, the cache misses entirely and you pay full price.

Why does it matter for your business?

For any owner running an AI feature on a long, stable prompt at any volume, prompt caching is the largest single cost lever available in 2026. The arithmetic is unforgiving. A 12,000-token policy document used 200 times in a 5-minute window costs about £7 in input tokens at standard Claude Sonnet rates. With caching enabled, the same workload costs about 75p. Roughly a 10x reduction on the cached portion.

The savings scale with three things: prompt length, repetition rate, and how tightly your queries cluster in time. Project Discovery, a security firm, reported a 59% reduction in LLM spend after implementing caching, rising to 66% once they tuned prompt structure for cache stability. That is real money for a system already in production.

There is a second benefit that often matters more than cost for chat-style applications. Cached reads also reduce time-to-first-token, the delay before the model starts replying, by 13 to 31% in independent measurements. For a customer-facing assistant, that is the difference between a 3-second wait and a snappy response.

The owner who does not know caching exists pays the full bill. The owner who knows but treats it as a setting to flip discovers, on a slow day when queries are sparse, that the cache expires between calls and the discount evaporates.

Where will you actually meet it?

You will meet prompt caching in three places, and they look different from each other. The first is your vendor’s pricing page or usage dashboard. Anthropic, OpenAI, Google, and AWS Bedrock all expose cache-write tokens and cache-read tokens as separate line items, billed at different rates. If your dashboard shows zero cached tokens after a few weeks of production use, your prompts are either too short, too variable, or structured wrong.

The second is the API itself. OpenAI applies caching automatically once your prompt clears 1,024 tokens. Anthropic requires you to mark the cache boundary explicitly with a cache_control parameter on the cached block. Google Gemini offers both implicit and explicit caching, with hourly storage pricing on the explicit version for long-running workflows. The mechanics differ, and a vendor that cannot tell you their minimum, their TTL, their write premium, and their read discount is not yet giving you the information you need.

The third place is the design of your application. Caching only works if the unchanging part of the prompt actually stays unchanged. A timestamp in the system prompt, a user ID near the top, a tool definition reordered between requests, any of these break the cache silently. Building applications that maintain stable prefixes is a design discipline rather than a runtime tweak. Once it is in place, the savings compound; before it is in place, the feature is invisible.

When to ask about it, when to ignore it

Ask about caching when you have a long stable prefix, 1,024 tokens or more, used many times within a short window. Internal assistants on a policy document, support bots with fixed tone and product context, code review tools that resend the same repository context, all qualify. In those workloads, caching is the closest thing to free money the vendor offers, because it directly reduces their compute costs as well as your bill.

Ignore caching when any of three conditions are true. If your prompts are under the vendor minimum, typically 1,024 tokens, the feature simply does not activate. If your context changes every request, fresh document per query, no overlap, there is nothing to cache. And if your traffic is sparse, a few queries spread across the day, the 5-minute TTL expires between calls and you pay the write premium without recovering it in reads. A solo consultant making three or four queries a day will spend more with caching enabled than without.

The question to put to any vendor pitching an AI feature is direct: what is your minimum cache size, what is the TTL, what is the write premium, what is the read discount, and what causes a cache miss? If the answers are vague, you are not making an informed cost decision yet.

Tokens are the units AI vendors charge by, roughly three-quarters of a word in English. A prompt is just a sequence of tokens, and caching is a discount on a portion of that sequence. The post on what is a token covers how the count actually works.

Input and output tokens are billed separately, and input usually dominates the bill on long-prompt workloads. That is why caching, which only discounts input, can move the headline cost so dramatically. Output tokens are not affected by caching at all.

Context window is the maximum number of tokens the model can process in a single request, currently 200,000 on Claude and around 1 million on Gemini Pro. Caching does not extend the window, it just reduces the cost of using a large portion of it repeatedly.

Batch API is a separate cost lever, available on OpenAI and Anthropic, that processes requests asynchronously at a flat 50% discount on both input and output, with a 24-hour turnaround. It is the right answer for non-urgent work like nightly report generation. Caching is for interactive systems where latency matters. Both can stack, though the marginal gain of layering them is small.

Fine-tuning is the deeper, slower customisation lever, retraining the model itself on your examples. It is more expensive and slower to update than caching and only earns its keep at high volume. The post on fine-tuning covers the trade-off in more detail.

If your AI bill is creeping up and you have a long stable prompt anywhere in the stack, prompt caching is the first place to look. If it is not, it is not.

Sources

Anthropic (2026). Prompt caching documentation including cache_control mechanics, write premium, read discount, and TTL options. https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching Anthropic (2026). API pricing page with cache-write and cache-read line items per model. https://www.anthropic.com/api/pricing OpenAI (2026). Prompt caching guide covering automatic caching mechanics and minimum thresholds. https://platform.openai.com/docs/guides/prompt-caching Google (2026). Gemini context caching documentation, both implicit and explicit caching. https://ai.google.dev/gemini-api/docs/caching AWS (2026). Bedrock prompt caching documentation with cost-and-usage breakdown for cache reads and writes. https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html Project Discovery (2024). How we cut LLM cost with prompt caching, real-world before-and-after numbers showing a 59% reduction. https://projectdiscovery.io/blog/how-we-cut-llm-cost-with-prompt-caching SSW (2025). Best practices for prompt caching including prefix stability and cache invalidation behaviour. https://www.ssw.com.au/rules/do-you-follow-best-practices-for-prompt-caching Tian Pan (2026). Prompt cache break-even math, working out how many reads are needed to recover the write premium. https://tianpan.co/blog/2026-04-17-prompt-cache-break-even-math Sebastian Raschka (2024). Coding the KV cache in LLMs, the technical underpinning that prompt caching productises. https://magazine.sebastianraschka.com/p/coding-the-kv-cache-in-llms

Frequently asked questions

How is prompt caching different from RAG?

RAG retrieves relevant documents at query time and passes them into the prompt. Prompt caching stores the model's processed view of the unchanging prefix of any prompt, retrieved or otherwise. The two are complementary. A RAG pipeline can use caching for the stable system prompt and tool definitions while the retrieved documents change per query.

Will prompt caching change the answers my AI gives?

No. Caching only stores the model's intermediate calculations on the unchanging prefix of the prompt. The model still generates fresh output for every request, conditioned on the full prompt including the user's specific question. You get the same answers, just faster and at lower cost on the cached portion. The output itself is byte-for-byte identical to a non-cached run with the same inputs.

What happens to my data inside a cached prefix?

The cached state lives on the vendor's infrastructure for the duration of the TTL, default 5 minutes, optionally up to 1 hour or longer. It is keyed by a hash of your prompt and not shared across organisations. For regulated data, treat extended retention as a compliance question and review your vendor's data residency policy before enabling it.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation