5  State, Memory, and Context

Learning Objective

After this chapter, you should be able to separate prompt context, tool observations, durable state, traces, eval results, and memory.

Why This Matters

Agent discussions often collapse “memory,” “state,” and “context” into one word. That is operationally dangerous. A prompt instruction, a repository file, a JSONL trace, and an eval result have different authority.

Core Concept

State Type Durable? Trusted? Goes Into Prompt? Example
User request No Partially Yes “Find the bug”
Tool output Maybe More grounded than model text Sometimes File content
Trace Yes Operational evidence Usually no JSONL event
Eval result Yes Measurement artifact No pass/fail
Summary memory Maybe Risky Sometimes compressed history
External DB Yes Depends Retrieved selectively vector index

This table is author interpretation. The repo enforces part of it: traces and eval reports are artifacts, while tool output is bounded before it can become context.

Authority Is Not the Same as Availability

Information can be available without being authoritative. A README can be read by a tool, but it is not allowed to override system policy. A model can produce a summary, but the summary is not a durable fact unless some process accepts it. A trace can show that a file was read, but it does not prove the diagnosis is correct. An eval can say a check passed, but it only proves the check it actually implements.

This distinction is central to agentic systems because the prompt often mixes many categories of text: task instructions, tool schemas, retrieved snippets, previous model outputs, summaries, and policy reminders. If the runtime does not preserve labels, the model sees a flat stream of tokens. The system owner still needs to know which tokens are instructions, which are observations, and which are untrusted content.

Context Assembly as a State Transition

Prompt context should be treated as an assembled artifact. It is not simply “the conversation.” It is the result of decisions:

  • which instructions are stable,
  • which tool schemas are included,
  • which observations are selected,
  • which memory is retrieved,
  • which content is summarized,
  • which budget is enforced.

The local ContextTracker makes a small part of that transition visible. It records segment names, token estimates, cacheability, cumulative growth, and large dynamic outputs. The estimator is crude, but the state model is useful: every added segment changes the next inference request.

Trust Labels

In an LLM-backed version, the runtime should label tool output. A file from prompt_injection_repo should be framed as untrusted repository content, not as an instruction. A trace event should be stored outside the prompt unless deliberately summarized. An eval result should normally inform deployment, not model reasoning.

Context Is a Lossy Interface

Prompt context is a lossy interface between the runtime and the model. It has limited space, weak authority labels, and model-dependent interpretation. That does not make it useless. It means the runtime should preserve stronger artifacts outside the prompt.

For example, the trace can store exact tool calls. The report can summarize context growth. The eval can record pass/fail checks. The prompt may contain only the subset needed for the next inference. This separation lets the system be debuggable even when the prompt is compressed.

Case Study Step

In the triage case study, the same file content has multiple roles. As a tool result, it is raw observation. As a trace event, its size and source become operational evidence. As a finding, a selected excerpt becomes diagnostic evidence. As an eval condition, the filename or keyword becomes a measurable requirement.

The system gets cleaner when each role is explicit. The file content does not magically become memory just because the runtime read it.

Durable State vs Runtime State

Durable state survives the run. Runtime state exists only while the run executes. Confusing them creates failure modes. If a model summary is stored as durable memory without provenance, future runs may treat it as fact. If a trace is lost after a run, incident review becomes weaker. If prompt context is the only place evidence exists, the system cannot be audited after completion.

In this repo, durable artifacts include sample reports, fixture files, tests, and committed source. Runtime artifacts include generated traces and local report outputs. The distinction is visible in .gitignore: traces are generated, while sample reports are committed. That split is a design choice. It keeps volatile run evidence out of version control while preserving stable examples.

Production systems need a more mature version of the same distinction. Some traces should be retained. Some tool outputs should be redacted. Some summaries should be durable only after validation. Some memory should expire. The exact policy depends on the domain, but the categories are general.

Context Selection Policy

Context selection should be explicit enough to test. A policy might say: include system instructions, include tool schemas, include the user task, include at most five evidence snippets, include no raw log file above a threshold, and include summaries only if they cite source IDs.

The local repo does not implement a full prompt assembler, but the ContextTracker points in that direction. It gives every segment a name and a cacheability flag. That is the minimum structure needed to ask whether context is growing safely.

Memory Is a Product Feature

Memory should not be added as a generic enhancement. It is a product feature with user expectations and failure modes. What should be remembered? Who can inspect or delete it? Can it cross projects or tenants? Does it expire? How is prompt injection removed? How does a user correct it?

For a staff-level design, require a memory specification before adding durable memory. If the team cannot answer ownership, retention, correction, and trust questions, keep memory out of the system. Use traces and reports for operational evidence instead.

The local lab does not need memory because each run is self-contained. That absence keeps the first set of boundaries easier to understand.

Staff Practice Notes

Memory is often proposed as a product feature before it is understood as a state system. Ask what should be remembered, who can see it, how it expires, how it is corrected, and whether it is evidence or preference. If those questions sound premature, the memory feature is probably being treated as magic storage for model convenience.

For context, ask the opposite question: what should not enter the prompt? Staff-level systems usually fail from over-inclusion before they fail from under-inclusion. Raw logs, stale summaries, untrusted documents, and irrelevant files all feel useful until they degrade cost, latency, grounding, or safety. Prompt context should be assembled with the same care as an API request.

Operational Invariants

State must be labeled before it is used. The runtime should know whether a piece of text is user instruction, developer instruction, current file evidence, retrieved memory, tool output, trace summary, or report prose. Labels are not bureaucracy; they determine which content can influence future actions.

State must have freshness semantics when it persists across runs. A memory record without source commit, creation time, evidence references, and invalidation policy is a liability. It may still be useful as a hint, but it should not silently override current observations.

State should have different storage destinations for different purposes. Prompt context is optimized for model behavior. Trace data is optimized for reconstruction. Reports are optimized for review. Durable memory is optimized for reuse. Collapsing those surfaces into one text blob makes the system simpler to prototype and harder to trust.

The Lab

python -m agentic_systems_lab.agent

Reading the Lab Output

The command prints the final structured result. The state lesson appears when you compare that result with the generated trace. The final JSON is the answer artifact. The trace is the runtime evidence. The files are source observations. The eval report is measurement. They are related, but they are not interchangeable.

When adapting the pattern, resist the urge to store the final answer as memory by default. First decide what should be remembered, why it should persist, and how it can be corrected.

A useful review exercise is to draw arrows between artifacts. Files feed tool output. Tool output feeds context observations. Context observations feed diagnosis. Diagnosis feeds final JSON and eval checks. Trace records the path. Report summarizes it. If any arrow is implicit, the system will be harder to debug.

Code Walkthrough

run_repo_triage records context_observation events after reading files. It does not persist those observations as memory. It writes trace events and returns a structured final result.

This distinction is subtle but important. A context_observation says that some tool output was considered during a run. It does not say the observation should be reused later, trusted as instruction, or stored as durable memory. The trace preserves evidence for review; it does not become the agent’s long-term state.

The context profiler uses a different representation: named segments with character counts, crude token estimates, and cacheability labels. Those segments are a prompt-assembly model, not a memory system. They let the reader reason about context growth without introducing persistence, retrieval, or user-specific state.

If you add memory later, do not reuse the context or trace structures blindly. Memory needs provenance, freshness, deletion, and trust labels. The current code intentionally stops before that boundary so state surfaces remain easy to distinguish.

Expected Output

The agent returns JSON with summary, files_inspected, findings, and recommended_next_step. The trace contains agent_start, policy_check, tool_call, context_observation, and agent_finish.

The important comparison is between answer and trace. The answer should be concise enough for a user. The trace should be detailed enough for a reviewer. If the answer contains claims not supported by trace events, the output is not reviewable even if it reads well.

Failure Mode

If prompt text is treated as authoritative state, a system can confuse instructions with observations. The prompt_injection_repo fixture later demonstrates why repository content should be treated as untrusted data.

The symptom is state collapse. Tool output, memory, user instruction, developer instruction, trace summary, and report text all become undifferentiated prompt material. Once this happens, the runtime cannot explain why one statement outranked another. The model may follow repository prose as instruction, prefer stale memory over current files, or cite derived summaries as if they were primary evidence.

The root cause is missing provenance. State surfaces need labels and freshness semantics. Current file content, retrieved memory, generated report summaries, and trace metadata can all be useful, but they should not have the same authority. A staff-level system should ask where each state item came from, when it was produced, whether it is trusted, and whether it belongs in prompt context.

The artifact that exposes the failure is a context or trace record with source labels. If an observation entered the prompt, the system should know whether it came from a file, memory record, tool summary, or user request. If a later answer depends on stale or untrusted content, the review should be able to identify that dependency.

Production Translation

Production systems need ownership rules for state. Durable facts should live in durable stores. Runtime evidence should live in traces. Prompt context should be assembled intentionally and capped.

Long-term memory deserves special caution. A memory store can become a persistence layer for stale summaries, injected instructions, or cross-user leakage if ownership and retrieval rules are weak. This book does not implement long-term memory; that omission is intentional. The local system first teaches state labeling and bounded context before adding a durable memory surface.

In production, state review should be part of privacy, security, and reliability review. Ask which state can cross users, repositories, tenants, or deployment stages. Ask what can be deleted, corrected, or invalidated. Ask whether a retrieved memory is evidence, instruction, or a hint. If the system cannot answer those questions, adding memory will likely increase apparent intelligence while reducing auditability.

Design Review Questions

For each state surface, ask:

  • Who owns it?
  • Is it durable?
  • Can it be corrected?
  • Can it cross users, tenants, repositories, or sessions?
  • Is it trusted as instruction, evidence, or untrusted observation?
  • Does it enter prompt context?
  • Is it stored in traces or reports?
  • What retention policy applies?

This review often reveals that a proposed “memory” feature is actually several different systems: retrieval, summarization, persistence, consent, deletion, and prompt assembly.

Review Rubric

Reject designs that collapse user instructions, tool output, memory, and trace summaries into one unlabeled prompt blob. That design cannot explain authority or provenance.

Require review when persistent memory exists without freshness, correction, deletion, or source metadata. The system may still be useful, but memory should not be trusted as current evidence.

Accept the state design when each surface has a label, owner, retention rule, prompt-entry rule, and trace evidence showing when it influenced a run.

Implementation Notes

If you add memory later, do not put it directly into prompts. Add a retrieval boundary. Add metadata. Add a trust label. Add tests showing that unrelated or malicious memory is not injected into privileged instructions. Add a deletion path. Add a trace event that records which memory entries were used.

Treat memory retrieval like a tool call. It observes a state surface and returns untrusted or partially trusted content.

Extension Path

A safe memory extension starts as read-only retrieval with provenance. Create a small memory fixture containing source commit, evidence files, summary text, creation time, and freshness policy. Then write an eval where stale memory conflicts with current repository evidence.

The desired behavior is not “never retrieve memory.” The desired behavior is “prefer current evidence when memory is stale or unsupported.” That is a precise rule the runtime can trace and the eval suite can test.

Worked Scenario: A Stale Summary

Imagine the triage bot stores a memory: “The calculator package expects division by zero to return None.” Later, the project changes its contract and now expects ZeroDivisionError. If the agent retrieves the old memory without provenance or freshness checks, it may produce the wrong diagnosis even while reading the current code.

The fix is not “never use memory.” The fix is to treat memory as a state surface with metadata. When was it written? From what evidence? Does it apply to this repository version? Can the user correct it? Should it enter prompt context, or should the runtime prefer current file content?

The local repo avoids this problem by not implementing memory. That is a conservative teaching choice. It lets the reader see state surfaces before adding a persistent one.

Chapter Synthesis

State is not one thing. Prompt context, traces, reports, durable memory, and tool output serve different purposes and should not be collapsed into a single text stream. The chapter makes that separation explicit before introducing more complicated agent behavior.

The most transferable habit is provenance. Before using a piece of state, ask where it came from, how fresh it is, whether it is trusted, and whether it belongs in the prompt. That habit is more valuable than any particular memory implementation.

Evidence and References

The state taxonomy is design judgment. The trace shape is repo evidence and aligns with tracing concepts described by OpenAI Agents SDK docs (OpenAI Agents SDK 2025).

Takeaways

  • State surfaces have different authority and should be labeled before use.
  • Memory without provenance and freshness can be worse than no memory.
  • Prompt context, traces, reports, and durable storage serve different purposes.

Exercises

  1. Identify every state surface in the repo triage agent. Include repository files, policy configuration, traces, eval outputs, reports, and process-local variables.
  2. Mark each surface as authoritative, trusted configuration, untrusted observation, derived evidence, or presentation artifact.
  3. Decide which observations would enter an LLM prompt. For each one, write the label that should accompany it in prompt context.
  4. Add a trace field that would improve state auditability. Explain which incident question the new field answers.
  5. Design a memory record schema for repo summaries. Include source commit, evidence files, creation time, author, freshness policy, and invalidation trigger.
  6. Write a failure scenario where stale memory produces a correct-looking but wrong answer. Then write the eval that would catch it.
  7. Compare context, trace, memory, and report as storage surfaces. Explain which one should be optimized for model performance and which one should be optimized for audit.
  8. Define a retention policy for traces that may contain source code or sensitive logs.

Checklist

  • Context is not memory.
  • Trace evidence is not prompt context.
  • Tool output is data, not authority.
  • Persistent memory requires provenance and freshness.
  • Reports should summarize state without becoming the source of truth.
  • Untrusted observations should be labeled before entering prompts.
  • State surfaces need owners and retention rules.
  • Current repository evidence should outrank stale summaries.
OpenAI Agents SDK. 2025. Tracing. https://openai.github.io/openai-agents-python/tracing/.