What is prompt caching? Plain-English guide for owners

A 30-staff professional services firm I worked with earlier this year had built an internal AI assistant on top of a 15,000-token policy and procedures document. Every query the team typed loaded the full policy as a system prompt so the answers stayed consistent with how the firm actually operated. About 200 queries a day, five days a week. On Claude Sonnet at standard input rates, the system-prompt portion alone was running close to £160 a month, and the bill was growing as adoption spread.

The fix was a setting they did not know existed. Once prompt caching was switched on and the prompt was structured so the policy sat at the front, the same workload cost roughly £16 a month. Same assistant. Same answers. Ten times less spend on the cached portion.

That is the lever. It is worth understanding properly because the autopilot version helps a little, and the engineered version helps a lot more.

What is prompt caching?

Prompt caching stores the model’s processed view of the unchanging prefix of your prompt, typically a system prompt, policy document, codebase, or any fixed context, and reuses it on subsequent requests within a short window. On the cached portion, the vendor bills you at roughly 10% of standard input cost. The user-specific part of the prompt, the actual question, is still processed fresh each time.

Technically, what is being stored is the key and value matrices the model computes during its attention layers, the “KV cache” you will see in vendor documentation. Those matrices are expensive to compute on a long prompt. Caching skips the recomputation when the same prefix arrives again within the TTL, default 5 minutes on the major vendors.

From your side as an owner, the practical move is structural. Put the things that do not change first, the system prompt, the policy, the FAQ. Put the things that do change last, the user’s question, the timestamp, the session ID. If anything in the first 1,024 tokens shifts between requests, the cache misses entirely and you pay full price.

Why does it matter for your business?

For any owner running an AI feature on a long, stable prompt at any volume, prompt caching is the largest single cost lever available in 2026. The arithmetic is unforgiving. A 12,000-token policy document used 200 times in a 5-minute window costs about £7 in input tokens at standard Claude Sonnet rates. With caching enabled, the same workload costs about 75p. Roughly a 10x reduction on the cached portion.

The savings scale with three things: prompt length, repetition rate, and how tightly your queries cluster in time. Project Discovery, a security firm, reported a 59% reduction in LLM spend after implementing caching, rising to 66% once they tuned prompt structure for cache stability. That is real money for a system already in production.

There is a second benefit that often matters more than cost for chat-style applications. Cached reads also reduce time-to-first-token, the delay before the model starts replying, by 13 to 31% in independent measurements. For a customer-facing assistant, that is the difference between a 3-second wait and a snappy response.

The owner who does not know caching exists pays the full bill. The owner who knows but treats it as a setting to flip discovers, on a slow day when queries are sparse, that the cache expires between calls and the discount evaporates.

Where will you actually meet it?

You will meet prompt caching in three places, and they look different from each other. The first is your vendor’s pricing page or usage dashboard. Anthropic, OpenAI, Google, and AWS Bedrock all expose cache-write tokens and cache-read tokens as separate line items, billed at different rates. If your dashboard shows zero cached tokens after a few weeks of production use, your prompts are either too short, too variable, or structured wrong.

The second is the API itself. OpenAI applies caching automatically once your prompt clears 1,024 tokens. Anthropic requires you to mark the cache boundary explicitly with a cache_control parameter on the cached block. Google Gemini offers both implicit and explicit caching, with hourly storage pricing on the explicit version for long-running workflows. The mechanics differ, and a vendor that cannot tell you their minimum, their TTL, their write premium, and their read discount is not yet giving you the information you need.

The third place is the design of your application. Caching only works if the unchanging part of the prompt actually stays unchanged. A timestamp in the system prompt, a user ID near the top, a tool definition reordered between requests, any of these break the cache silently. Building applications that maintain stable prefixes is a design discipline rather than a runtime tweak. Once it is in place, the savings compound; before it is in place, the feature is invisible.

When to ask about it, when to ignore it

Ask about caching when you have a long stable prefix, 1,024 tokens or more, used many times within a short window. Internal assistants on a policy document, support bots with fixed tone and product context, code review tools that resend the same repository context, all qualify. In those workloads, caching is the closest thing to free money the vendor offers, because it directly reduces their compute costs as well as your bill.

Ignore caching when any of three conditions are true. If your prompts are under the vendor minimum, typically 1,024 tokens, the feature simply does not activate. If your context changes every request, fresh document per query, no overlap, there is nothing to cache. And if your traffic is sparse, a few queries spread across the day, the 5-minute TTL expires between calls and you pay the write premium without recovering it in reads. A solo consultant making three or four queries a day will spend more with caching enabled than without.

The question to put to any vendor pitching an AI feature is direct: what is your minimum cache size, what is the TTL, what is the write premium, what is the read discount, and what causes a cache miss? If the answers are vague, you are not making an informed cost decision yet.

Tokens are the units AI vendors charge by, roughly three-quarters of a word in English. A prompt is just a sequence of tokens, and caching is a discount on a portion of that sequence. The post on what is a token covers how the count actually works.

Input and output tokens are billed separately, and input usually dominates the bill on long-prompt workloads. That is why caching, which only discounts input, can move the headline cost so dramatically. Output tokens are not affected by caching at all.

Context window is the maximum number of tokens the model can process in a single request, currently 200,000 on Claude and around 1 million on Gemini Pro. Caching does not extend the window, it just reduces the cost of using a large portion of it repeatedly.

Batch API is a separate cost lever, available on OpenAI and Anthropic, that processes requests asynchronously at a flat 50% discount on both input and output, with a 24-hour turnaround. It is the right answer for non-urgent work like nightly report generation. Caching is for interactive systems where latency matters. Both can stack, though the marginal gain of layering them is small.

Fine-tuning is the deeper, slower customisation lever, retraining the model itself on your examples. It is more expensive and slower to update than caching and only earns its keep at high volume. The post on fine-tuning covers the trade-off in more detail.

If your AI bill is creeping up and you have a long stable prompt anywhere in the stack, prompt caching is the first place to look. If it is not, it is not.

What is prompt caching? Why it matters for your business

Key takeaways

What is prompt caching?

Why does it matter for your business?

Where will you actually meet it?

When to ask about it, when to ignore it

Sources

Frequently asked questions

How is prompt caching different from RAG?

Will prompt caching change the answers my AI gives?

What happens to my data inside a cached prefix?

Ready to talk it through?

If any of this sounds familiar, let's talk.

What is prompt caching? Why it matters for your business

Key takeaways

What is prompt caching?

Why does it matter for your business?

Where will you actually meet it?

When to ask about it, when to ignore it

Related concepts

Sources

Frequently asked questions

How is prompt caching different from RAG?

Will prompt caching change the answers my AI gives?

What happens to my data inside a cached prefix?

Ready to talk it through?

Related reading

Zero-shot vs few-shot learning: when AI works on tiny data

What is AutoML? Why it matters for your business

What is edge AI? Why running AI locally matters for your business

If any of this sounds familiar, let's talk.