Prompt Caching

Prompt caching is an inference optimization that reuses computation for a repeated prefix of model input. When multiple requests begin with the same system instructions, examples, documents, or tool definitions, the provider can cache the intermediate attention state for that prefix instead of recomputing it from the first token each time.

Caching can reduce latency and input-processing cost for applications with long, stable prompts. Typical candidates include:

Prompt caching normally requires exact or near-exact prefix matching. Stable content should therefore appear before request-specific data. Reordering tools, changing whitespace, inserting timestamps, or modifying early instructions may invalidate the cached prefix even if most content is unchanged.

Prompt caching does not increase the context window or improve answer quality directly. The model still receives the same input; the optimization changes how efficiently that input is processed. It also differs from agent memory, which stores information for later retrieval rather than cached model computation.

Providers may use different cache lifetimes, minimum prefix lengths, write costs, and accounting rules. Applications should monitor cache-hit rates and effective cost rather than assuming that a long static prompt is being reused.

Sensitive data remains subject to the provider's storage and retention controls. Developers should verify how cached prefixes are isolated between organizations and whether caching is compatible with their data-governance requirements.

See the Anthropic prompt caching documentation for an implementation based on explicit cache breakpoints.

The LLM Knowledge Base is a collection of bite-sized explanations for commonly used terms and abbreviations related to Large Language Models and Generative AI.

It's an educational resource that helps you stay up-to-date with the latest developments in AI research and its applications.

Promptmetheus © 2023-present