13  Prompt Caching and Prefix Stability

Learning Objective

After this chapter, you should be able to identify prompt-layout choices that improve or break prefix cacheability.

Why This Matters

Provider documentation describes prompt caching as a latency and cost optimization when requests share prompt prefixes. OpenAI’s prompt caching guide states that cache hits require exact prefix matches and recommends placing static content before variable content (OpenAI 2025). vLLM documents automatic prefix caching as reuse of KV cache when a new request shares an existing prefix (vLLM 2025).

Core Concept

Cache-friendly prompt layout usually looks like:

stable instructions
tool schemas
static examples
task
user-specific state
retrieval
tool observations

Cache-hostile layout puts timestamps, UUIDs, randomized JSON, retrieval results, or tool observations before stable content.

Segment-Level Thinking

Prompt caching is easier to reason about when prompts are assembled from named segments rather than string concatenation. A segment might be system, tool_schemas, task, memory, retrieval, or tool_result. Once the segments are named, the runtime can ask:

  • Which segments are stable across requests?
  • Which segments are user-specific?
  • Which segments change every run?
  • Which segments are large?
  • Which segments must appear before model reasoning?

This is not only a performance concern. Segment naming also helps state discipline. The same structure that improves cache analysis makes prompt-injection review and context budgeting easier.

Cache Breaker Examples

The local demo flags a timestamp before the static prefix. Other common breakers include:

  • UUIDs before tool schemas,
  • randomized JSON key ordering,
  • dynamic retrieval before stable instructions,
  • user profile state before reusable system text,
  • tool output mixed into reusable examples.

The fix is usually architectural: assemble stable text first, serialize deterministically, and push volatile content later.

Cacheability vs Personalization

Personalization and cacheability often pull in opposite directions. User-specific memory, account state, and retrieved documents are dynamic. Stable instructions and tool schemas are reusable. A good prompt layout isolates those regions.

This matters for agentic systems because tool schemas can be long and stable. If they appear after dynamic content, the system may waste an optimization opportunity. If they appear before dynamic content, they can remain in the stable prefix when provider behavior supports that.

Case Study Step

The repo triage prompt would likely have stable instructions and stable tool schemas. The repository content, user request, and trace-derived observations are dynamic. The cache demo teaches the layout principle before the book introduces any provider-specific integration.

Deterministic Serialization

Prompt caches are sensitive to exact text. That makes serialization choices important. JSON key ordering, whitespace, generated IDs, timestamps, and nondeterministic collection ordering can all alter the prefix.

The local cache demo uses deterministic JSON serialization for tool schema content. That is a tiny implementation detail with broad implications. If your runtime serializes tool schemas from an unordered source, the prompt may look semantically equivalent while failing to share a prefix.

The design rule is simple: make stable content stable in bytes, not just stable in meaning.

Cache Metrics

A production cache analysis should track:

  • stable prefix tokens,
  • dynamic tokens,
  • first changed segment,
  • cache-hit rate where provider data is available,
  • latency difference between cache hit and miss,
  • cost difference where pricing applies.

The local demo reports only static analysis. It does not observe provider cache hits. That limitation is explicit because provider-specific behavior must be measured or cited.

Prompt Layout Review

A useful prompt-layout review asks for the assembled prompt with segment labels. Reviewers should be able to see which text is stable, which text is dynamic, and which text is untrusted. If the team cannot show the assembled prompt, cache analysis becomes guesswork.

This review also catches safety problems. Untrusted content that appears before policy reminders or tool schemas may not be a cache issue only; it may also be an instruction/data separation issue.

Staff Practice Notes

Prompt caching is an optimization, but prompt assembly is architecture. The same refactor that improves prefix stability can also change instruction order, trust labeling, and schema visibility. Review cache changes with correctness evals, not only with token ratios.

Ask whether the prompt is byte-stable where it claims to be stable. Developers often reason semantically: “the instructions are the same.” Cache systems usually care about exact prefixes under provider-specific rules. Stable serialization, sorted schemas, and volatile metadata placement are engineering details, not prompt-writing trivia.

Operational Invariants

Stable prompt segments should be serialized deterministically. Instructions, tool schemas, and output schemas should not change because of dictionary ordering, timestamps, UUIDs, or formatting churn. If they change, the runtime should know why.

Dynamic prompt segments should appear after stable segments when cacheability matters. User request, retrieval output, tool observations, memories, and timestamps are often necessary, but they should not break the reusable prefix unless the product intentionally trades cacheability for another requirement.

Cache-oriented refactors should be guarded by correctness evals. Moving content around can change instruction priority or trust labeling. A better cache ratio is not a win if the prompt becomes less safe or less accurate.

The Lab

python -m agentic_systems_lab.context --cache-demo

Reading the Lab Output

The cache demo reports first_changed_segment, shared_prefix_tokens, and cache_breakers. If first_changed_segment is timestamp, the prompt layout has placed volatile content before reusable content. That is the simplest possible cache-breaker example.

The recommendation list should be read as prompt-layout guidance, not as a provider guarantee.

Inspect the first changed segment before looking at aggregate token counts. The first changed segment tells you where the stable prefix broke. If that segment is a timestamp, UUID, or randomized schema, the fix is likely prompt assembly. If it is the user request, the layout may already be reasonable.

Code Walkthrough

analyze_prompt_cache_layouts compares two segment lists. It reports candidate tokens, shared prefix tokens, first changed segment, dynamic tokens, cache breakers, and recommendations.

The function treats prompt layout as structured data. Each segment has a name, text, and cacheability label. The comparison walks the layouts until a segment changes, then estimates how much prefix remains reusable under the local approximation. This is intentionally simpler than provider internals, but it captures the engineering habit.

Cache breakers are concrete: timestamp before stable instructions, UUID before tool schemas, randomized JSON, retrieval before stable prefix, or dynamic metadata in the wrong place. Naming the breaker is more useful than only reporting a low ratio because it tells the prompt assembler what to fix.

The recommendation output should remain conservative. The local demo can say “move dynamic content later” or “serialize schemas deterministically.” It should not claim a provider-specific latency or cost reduction unless that claim is sourced or measured.

Expected Output

The demo identifies timestamp as the first changed segment and reports timestamp_before_static_prefix as a cache breaker.

The expected output proves the analyzer can identify a local layout defect. It does not prove provider-side cache behavior or savings. The correct next step is prompt-assembly discipline: move volatile metadata after stable instructions and rerun correctness checks.

Failure Mode

If a timestamp appears before stable instructions and schemas, the reusable prefix can shift every request. That makes caching less likely even though the semantic prompt looks similar.

The symptom is an optimization that disappears in production. Prompt templates appear unchanged to humans, but byte-stable prefixes are broken by timestamps, UUIDs, random JSON key order, retrieval snippets, or user-specific metadata placed too early. The model still receives a reasonable prompt, but the runtime loses a caching opportunity.

The root cause is treating prompt assembly as prose rather than serialization. Cache-sensitive prompt layout has invariants: stable instructions first, deterministic schemas, dynamic observations later, and no random fields before the stable prefix. These are software properties and should be tested like serializer behavior.

The artifact that exposes the failure is a cache-demo comparison or prompt-assembler test. The analysis should identify the first changed segment, estimate cacheable and dynamic tokens, and name cache breakers. It should also state that cacheability is an optimization claim, not a correctness guarantee.

Production Translation

Prompt layout is infrastructure design. In a production agent, stable instructions, tool schemas, and structured output schemas should be kept stable and early when provider caching behavior makes that useful. Provider-specific claims must be checked against current provider docs.

Do not optimize blindly. A cache-friendly layout that confuses instruction priority is not a win. Correctness comes first, then observability, then optimization. The local cache demo is a diagnostic tool, not a promise of provider savings.

For rollout, prompt layout changes should be reviewed like code changes. They can affect instruction priority, context labeling, schema stability, and optimization behavior at the same time. A good change includes prompt-assembler tests, eval results, and a cache-layout diff. The team should know whether it changed semantics, changed only serialization, or changed only dynamic content placement.

Design Review Questions

For prompt caching, ask:

  • Which prompt segments are byte-stable?
  • Which segments are dynamic?
  • Where do timestamps and IDs appear?
  • Are tool schemas serialized deterministically?
  • What is the first changed segment across similar requests?
  • Is cache behavior documented by the provider or measured locally?
  • What correctness risk would a layout change introduce?

Cacheability should be reviewed alongside instruction hierarchy and untrusted-content placement.

Review Rubric

Reject cache optimizations that reorder instructions or trust boundaries without eval coverage. Cacheability is not a correctness argument.

Require review when prompt layouts are cache-aware but not tested as serializers. Stable prefixes should be a software invariant, not a manual hope.

Accept the design when stable and dynamic segments are labeled, schemas serialize deterministically, first-changed-segment analysis exists, and provider-specific claims are sourced or measured.

Implementation Notes

The cache demo can evolve into a prompt assembler test. Given a task, the assembler would return labeled segments. Tests could assert that stable instructions and tool schemas appear before dynamic observations, that JSON serialization is sorted, and that timestamps do not appear in the stable prefix.

This turns prompt optimization into testable software behavior.

Extension Path

Turn the cache demo into a prompt-assembler test. The assembler should return labeled segments in order. Tests should assert that stable instructions and schemas come first, dynamic observations come later, and volatile metadata does not appear before the stable prefix.

Then add a report summary for cacheable-token ratio and first changed segment. Keep the language cautious: the report can identify local layout risk, while provider-specific savings require provider documentation or measurement.

Worked Scenario: Timestamp First

The cache demo puts a timestamp before stable instructions. That one choice makes the first segment differ across requests. Even if the rest of the prompt is identical, the shared prefix is lost in the local analysis.

The fix is not clever prompt wording. It is prompt assembly discipline. Put stable text first. Move timestamps later. Sort schemas. Keep dynamic retrieval out of the prefix. The optimization begins as software architecture, not prose.

Chapter Synthesis

Prompt caching turns prompt assembly into an engineering surface. Stable prefixes, deterministic schemas, and dynamic suffixes are not merely prompt-writing preferences. They are serialization properties that can be tested and reviewed.

The chapter keeps the claim local: the demo identifies cache breakers; it does not promise provider savings. That discipline is the evidence policy in action. Optimization claims should either be sourced, measured, or narrowed to what the local artifact proves.

Evidence and References

OpenAI and vLLM documentation support the prefix-match framing (OpenAI 2025; vLLM 2025). The local demo supports the cache-breaker analysis.

Takeaways

  • Cache-friendly prompts require stable serialization, not just similar wording.
  • Dynamic metadata belongs after stable instructions and schemas when cacheability matters.
  • Cache optimization should be guarded by correctness and safety evals.

Exercises

  1. Add a UUID before tool schemas and inspect the demo. Confirm that the first changed segment moves earlier and cacheable tokens fall.
  2. Compare deterministic and randomized JSON. Use sorted keys for one layout and randomized key order for another.
  3. Move retrieval after stable instructions. Explain why the change improves cacheability without changing correctness.
  4. Add a report field for cacheable token ratio. Define the threshold that should trigger review.
  5. Write a prompt assembler test that asserts stable instructions and tool schemas precede dynamic observations.
  6. Add a timestamp to the stable prefix and make the test fail. Then move it to the dynamic suffix and make the test pass.
  7. Document which provider or runtime behavior is assumed by your cache analysis and which claims are only local approximations.
  8. Design a cache-safety review: what prompt changes could alter behavior while improving cacheability, and how should evals catch them?

Checklist

  • Stable prefixes are an optimization surface.
  • Dynamic content belongs later.
  • Never depend on cache behavior for correctness.
  • Prompt assembly should be tested like any other serializer.
  • Tool schemas should be serialized deterministically.
  • Cache analysis needs byte-stability, not semantic similarity.
  • Optimization claims should cite provider docs or local measurements.
  • Correctness evals must guard cache-driven prompt refactors.