AI Fundamentals

What Is Prompt Caching? How Reused LLM Context Saves Time and Cost

Last updated 2026-06-21 · ByteCosts

Direct answer

Prompt caching is an inference optimization that reuses eligible processing for a repeated prompt prefix across requests. It can reduce latency and input cost when supported, but it is not response caching, semantic caching, or permanent model memory. Savings depend on the repeated token volume, cache-write and cache-read prices, expiration rules, and measured hit behavior.

Apply this concept - Prompt cache savings calculator →

Summary

Long prompts often contain a stable prefix followed by a smaller variable suffix. Examples include:

A system instruction shared by every request Large tool or function definitions A fixed policy document A codebase snapshot reused across questions A long reference document queried repeatedly Fewshot examples that remain unchanged A conversation history prefix with new turns appended

Without reuse, the model processes the repeated prefix again for each request. Prompt caching lets a provider or serving system reuse eligible intermediate work associated with that prefix under its documented rules.

The request still goes through inference. The model still processes uncached content and generates a new response. Caching changes the repeatedprefix portion of the workload.

What prompt caching reuses

Long prompts often contain a stable prefix followed by a smaller variable suffix. Examples include:

The request still goes through inference. The model still processes uncached content and generates a new response. Caching changes the repeatedprefix portion of the workload.

Prompt caching is not response caching

A response cache stores and returns a previous answer for a matching request. Prompt caching does not simply replay an old answer. It reuses promptprocessing work while allowing the model to generate a new output.

Semantic caching is another separate technique. It searches for a sufficiently similar earlier query and may reuse its answer. That introduces applicationlevel similarity thresholds and freshness risks.

Prompt caching also is not durable user memory. Cached computation can expire, be evicted, or become ineligible. It does not update model weights or guarantee that a fact remains available in later requests.

Why prefix order matters

Provider implementations commonly match reusable content from the beginning of a prompt. Stable content should therefore appear before highly variable content when the API's rules support that layout.

1. Stable system instructions 2. Stable tool definitions 3. Stable examples or documents 4. Conversation history 5. Current userspecific content

Changing bytes, tokenization, model, parameters, or cachecontrol boundaries can affect eligibility according to the provider. “Looks the same to a human” is not a sufficient test. Use providerreported cache usage.

Do not restructure a prompt solely for caching if the change harms instruction clarity or security. Correctness comes first.

Cache writes, cache reads, and uncached input

Pricing and usage schemas vary, but a promptcache ledger may include:

Uncached input tokens Cachecreation or write tokens Cacheread tokens Output tokens

The cost equation should preserve each category:

request cost = uncached input cost + cache write cost + cache read cost + output cost

A cache miss can be more expensive than a hit if the provider charges a separate write rate. Savings depend on reuse count, cache lifetime, prefix size, and the relationship between write, read, and ordinary input rates.

The existing provider promptcaching pricing analysis focuses on commercial differences. This page defines the mechanism.

The breakeven logic

Use variables rather than a fixed provider rate:

T = reusable prefix tokens N = total requests using that prefix Pinput = ordinary input price per token Pwrite = cachewrite price per token Pread = cacheread price per token

With one successful write and N 1 successful reads:

Savings require the cached version to cost less after accounting for misses, expiry, and any minimum eligible length. The promptcache savings calculator applies the providerspecific rates and hit assumptions.

Cache hit rate needs a precise denominator

Requests with any cached tokens Eligible requests that achieved a hit Reusable tokens served from cache Dollarweighted share of input served from cache

For cost modeling, tokenweighted reuse is often more informative than request hit rate. One small hit and one very large miss should not be treated as a 50 percent cost hit rate.

Eligible prefix tokens Cacheread tokens Cachewrite tokens Ordinary input tokens Requests with hits and misses Prefix identifier or template version Model and provider Cache age when available

Use nonPII identifiers for templates and workloads.

What breaks a cache hit

The exact rules are providerspecific, but common causes include:

The prefix changed Content order changed A different model or endpoint was used The cache expired or was evicted The prompt did not meet a minimum length A request parameter changed in a way that invalidates reuse Traffic did not return within the supported cache lifetime Dynamic content was inserted too early in the prefix

Version stable prompt assets deliberately. A hidden timestamp, request ID, or userspecific value near the front can destroy reuse.

Security and privacy considerations

Caching does not remove the need to understand provider data handling. Review the provider's documentation for cache isolation, retention, regional behavior, and zerodataretention compatibility. Do not infer privacy properties from the word “cache.”

At the application layer, avoid using a shared prefix identifier that exposes customer information. Ensure tenantspecific instructions and documents cannot be reused across unauthorized boundaries.

When prompt caching is useful

The repeated prefix is large Many requests reuse it Requests arrive within the cache lifetime The provider reports a meaningful cachedinput discount or latency benefit The stable prefix can remain identical Quality does not require rebuilding the prefix each time

It is less useful for short, unique prompts or traffic that rarely repeats before expiry.

What this article covers

What prompt caching reuses
Prompt caching is not response caching
Why prefix order matters
Cache writes, cache reads, and uncached input
The breakeven logic

Use it with ByteCosts calculators

After reading the research note, open the related calculator and replace the example assumptions with your own users, requests, tokens, seats, or platform usage.

The goal is to convert the article's cost pattern into a concrete monthly run-rate, per-user margin, or break-even point your team can discuss.

Frequently asked questions

Does prompt caching store the model's answer?

No. Prompt caching reuses eligible promptprocessing work. Response caching is the technique that stores and returns a previous answer.

Does a cache hit reduce outputtoken cost?

Normally the cached portion concerns input processing. Output is newly generated and is accounted for under the provider's output rules. Verify the exact API documentation.

Can I cache a changing conversation?

A stable prefix of a conversation may remain reusable while new turns are appended, depending on the provider's prefixmatching and cache rules. Measure reported cache usage rather than assuming a hit.

Is a high request hit rate always a large saving?

No. Savings depend on the number of tokens reused and the applicable write, read, and input rates. Tokenweighted and dollarweighted metrics are more useful than request count alone.

Cite this page

What Is Prompt Caching? How Reused LLM Context Saves Time and Cost. ByteCosts. Updated 2026-06-21. https://bytecosts.com/blog/what-is-prompt-caching/

Sources

Machine-readable

Markdown mirror