6 Tracing: From Transcript to Runtime Evidence

Learning Objective

After this chapter, you should be able to explain why a transcript is weaker than a trace and identify the minimum runtime events needed for an agent run.

Why This Matters

A final answer does not show the path. It does not show which files were read, which tool call failed, whether policy was checked, or how much context was accumulated. OpenAI’s Agents SDK tracing guide describes traces as end-to-end operations composed of spans and notes that tool calls and guardrails are traced by default in that SDK (OpenAI Agents SDK 2025). OpenTelemetry’s GenAI semantic conventions provide another source for naming and structuring generative-AI telemetry (OpenTelemetry 2025).

Core Concept

This repo uses JSONL events rather than a full tracing backend. The point is to make the evidence visible:

agent_start
policy_check
tool_call
context_observation
eval_check
agent_finish

Each event has a run_id and timestamp. Tool calls also record latency and success.

Transcript vs Trace

A transcript is optimized for human conversation. A trace is optimized for reconstruction. If the final answer says “the bug is in calculator.py,” the trace should show whether calculator.py was read, whether the read succeeded, how much content was observed, and whether any policy warnings occurred.

This difference matters when multiple explanations are plausible. Suppose the agent misses a bug. Did it fail because the file was absent? Did list_files hide it? Did the file read fail? Did context caps truncate the relevant line? Did the diagnosis rule fail after observing the right evidence? A transcript usually cannot answer those questions. A trace can if the runtime records the right events.

The local trace is intentionally not a distributed tracing system. It has no service graph, sampling policy, baggage, or backend exporter. But it still teaches the central habit: record the path, not just the conclusion.

Trace Review Procedure

For each agent run, inspect:

agent_start: Was the task and target repository correct?
policy_check: What was the runtime allowed to do?
tool_call: Which tools ran, with which arguments, and did they succeed?
context_observation: Which outputs entered the context budget?
agent_finish: Did the runtime finish normally?

Then compare the trace to the eval result. A passing eval with a bad trace can still indicate risk. A failing eval with a good trace can identify a diagnosis or output-schema problem.

Trace Completeness

Trace completeness is contextual. A local teaching repo can use a handful of events. A production system may need span IDs, parent IDs, model identifiers, prompt hashes, redaction status, token counts, retry attempts, and deployment version. The common principle is that the trace should answer the questions a reviewer will ask after a failure.

If the trace cannot answer those questions, add fields before the next incident. Do not wait for the final observability architecture; improve the evidence contract incrementally.

Case Study Step

The buggy_calc trace proves that the agent inspected calculator.py and test_calculator.py through policy-controlled tools. The trace does not prove the diagnosis is correct; the eval handles a different slice of evidence. This separation is useful. Trace answers “what happened?” Eval answers “did the observed behavior satisfy the check?”

Trace Anti-Patterns

Avoid traces that are merely transcripts in JSON. A useful trace should not only record model messages. It should record tool calls, policy checks, context size, failures, retries, and output validation. Otherwise the trace recreates the weakness of the transcript.

Also avoid traces that are too noisy to inspect. If every internal helper call becomes an event, the signal can disappear. The event set should be chosen around review questions. For this repo, the review questions are: what did the agent start with, what policy applied, what tools ran, what context was observed, and how did the run finish?

As systems mature, trace granularity can increase. The first requirement is not completeness in the abstract. It is enough evidence for the failures you expect to debug.

Redaction and Sampling

Production traces often need redaction and sampling. Redaction protects sensitive content. Sampling controls cost and volume. Both can weaken evidence if applied carelessly. If a trace redacts all tool arguments, it may no longer prove which file was read. If sampling drops failed runs, incident review becomes distorted.

The local repo avoids that complexity by running small examples locally. A production design should define which fields are always retained, which fields are redacted, and which runs are never sampled away. Failed or policy-violating runs usually deserve stronger retention than routine successes.

Staff Practice Notes

Tracing should be designed from the questions you will ask during an incident. “Why did it answer that?”, “What did it read?”, “Which policy was active?”, “Did validation fail?”, and “Was context dominated by logs?” are better schema drivers than generic completeness. If the trace cannot answer those questions, more logging volume will not fix it.

At the same time, avoid trace maximalism. Raw prompts, full tool outputs, and unbounded logs can create security and retention problems. A mature trace is selective: enough structure to reconstruct decisions, enough redaction to be safe, and enough artifact references to inspect raw evidence when authorized.

Operational Invariants

Every trace event should be attributable to exactly one run. The run ID is the join key across trace, eval, report, and incident notes. If artifacts cannot be joined, the system cannot reconstruct behavior reliably after the fact.

Every risky boundary should emit evidence. Policy activation, tool calls, context growth, output validation, and finish status are the minimum local boundaries in this repo. A production system may add model calls, retries, approvals, redaction state, and artifact hashes. The principle is the same: the trace should show where authority, data, or control flow changed.

Every trace schema should be reviewable by humans and consumable by tools. JSONL is plain enough for local inspection and structured enough for automated summaries. If the trace becomes too verbose for humans and too irregular for machines, it is no longer serving either audience well.

The Lab

python -m agentic_systems_lab.tracer
python -m agentic_systems_lab.agent

Reading the Lab Output

The tracer demo prints event counts. The agent run creates a richer trace. The most important habit is to inspect sequence, not only counts. A normal triage run should start, record policy, call tools, record context observations, and finish. If the sequence is broken, the final answer should be treated cautiously.

The trace summary is a compression. When debugging a real failure, open the JSONL trace itself.

When inspecting the JSONL, read it as a timeline. Confirm that policy appears before tool calls, tool calls include success state, context observations follow evidence gathering, and finish status appears exactly once. Event counts are useful, but ordering is what turns a log into process evidence.

Code Walkthrough

TraceLogger.log_event validates event type and appends JSON to a trace file. TraceLogger.timed_tool_call wraps a function, records success, and writes latency in milliseconds.

The logger is deliberately small. It does not try to be a full observability platform. It enforces the local schema, writes one JSON object per line, and keeps event construction close to the runtime. This makes traces easy to inspect in tests and easy to explain in the manuscript.

timed_tool_call is the most production-shaped helper because it captures both successful and failed operations. Success paths are not enough for incident review. A failed tool call needs tool name, arguments, latency, success state, and error information. Otherwise the trace shows only what worked and hides what the runtime struggled with.

The summarizer reads the JSONL file back and counts events. That round trip matters. It proves the trace is not only writeable but also usable as an artifact. Reports and future gates should consume trace data through structured readers rather than by scraping rendered prose.

Expected Output

python -m agentic_systems_lab.tracer writes traces/demo_trace.jsonl and prints event counts. Running the agent writes a richer trace with policy checks, tool calls, context observations, and finish status.

The expected output proves both write path and read path. A trace that can be written but not summarized is not yet a useful artifact. A summary with counts but no underlying JSONL is not enough for incident review. The lab keeps both visible.

Failure Mode

Without traces, failures become anecdotal. A team cannot distinguish “the model reasoned badly” from “the wrong file was read,” “a policy check failed,” or “context was dominated by log noise.”

The symptom is an incident review that starts from memory. One engineer remembers the model answer. Another remembers the tool output. A third remembers a policy change. None of those memories are durable enough to reconstruct the run. The team can debate hypotheses, but it cannot replay the evidence path.

The root cause is treating observability as log collection rather than runtime design. A useful trace is not just “print more things.” It records typed events with run IDs, policy, tool calls, success and failure fields, latency, context observations, and finish status. It lets reviewers separate evidence selection from model behavior and model behavior from deployment gating.

The artifact that exposes the failure is a trace completeness check. A run with a final answer but no tool_call events cannot prove evidence inspection. A run with tool calls but no policy_check cannot prove authority. A run with no finish event cannot prove whether output validation succeeded. The trace schema should make those gaps obvious.

Production Translation

Production systems need telemetry that can be inspected during development and incident response. This lab does not claim its JSONL format is enough for every deployment; it is the minimal local artifact that teaches the contract.

In production, trace retention and privacy become first-class concerns. Tool arguments may contain internal paths, customer identifiers, or secrets. Tool outputs may contain proprietary code or user data. A trace strategy therefore needs redaction, retention windows, access controls, and incident procedures. The local repo does not implement those controls; it makes the evidence surface explicit so those questions can be asked.

Operationally, traces need a service-level role. They support debugging, eval triage, release review, abuse investigation, and cost analysis. Those consumers want overlapping but different fields. If the trace is optimized only for local debugging, governance and release gates will ask for missing data later. If the trace logs everything for every consumer, it becomes expensive and risky. The right production trace is a deliberately scoped evidence product.

Design Review Questions

For tracing, ask:

What event proves the run started?
What event records active policy?
What event records tool calls and arguments?
What event records failures?
What event records context growth?
What event records output validation?
Which fields are redacted?
Which failed runs are retained?
How does a reviewer find the raw trace from a report?

A trace schema should be designed from incident questions backward.

Review Rubric

Reject traces that store only final answers or raw transcripts. They cannot prove tool use, policy, context growth, or validation.

Require review when traces are rich but unredacted, unbounded, or disconnected from reports. More data can create security and retention risk without improving decisions.

Accept the trace contract when event coverage answers incident questions, run IDs join artifacts, failure paths are recorded, and summaries can be generated without scraping prose.

Implementation Notes

The next trace improvement would be explicit policy_violation and warning events. Today, policy violations live on the policy object and can enter reports. A richer trace would emit them as first-class events. That would make incident reconstruction easier because the trace itself would contain both attempted action and policy outcome.

Another useful improvement is artifact linking: report path, eval report path, and trace path should reference each other.

Extension Path

Add first-class warning and policy_violation events. Start with tests that write those events, summarize them, and render them in the production report. Then connect suspicious prompt-injection detection and policy-denied actions to the new event types.

The important design choice is to keep warnings nonfatal unless a gate says otherwise. A trace warning should improve review evidence. A deployment gate can later decide whether that warning blocks, requires human review, or remains informational.

Worked Scenario: The Misleading Final Answer

Suppose the agent returns: “The bug is in calculator.py.” Without a trace, you cannot tell whether it read the file, guessed from the filename, or copied an earlier memory. With a trace, you can inspect the path. Did list_files include calculator.py? Did read_file succeed? Was test_calculator.py read? Did context observations include both files?

If the answer is right but the trace shows the wrong path, the run is not trustworthy. It may have succeeded accidentally. That is why traces are not just debugging tools. They are evidence of process quality.

Chapter Synthesis

Tracing turns an agent run from an anecdote into an artifact. The final answer tells you what the system said. The trace tells you what the system did. For tool-using systems, that distinction is central to debugging and deployment review.

The chapter also narrows ambition deliberately. JSONL is not a universal observability solution, but it makes the evidence contract visible. Once the contract is clear, richer tracing systems can implement the same semantics with spans, attributes, retention, and redaction.

Evidence and References

Tracing concepts are grounded in OpenAI Agents SDK tracing and OpenTelemetry GenAI conventions (OpenAI Agents SDK 2025; OpenTelemetry 2025). Local behavior is verified by tests/test_tracer.py.

Takeaways

A trace records what the system did, not only what it said.
Useful traces answer incident questions with structured events.
Redaction, retention, and artifact links are part of trace design.

Exercises

Add a warning event type. Write the test first and verify that summaries count warnings separately from failures.
Count tool calls per run. Distinguish attempted, successful, failed, and policy-blocked tool calls.
Write a bad trace and identify the missing evidence. Include at least one missing run ID, one missing success field, and one missing output-validation event.
Map one JSONL event to an OpenTelemetry span concept. State which fields would become span name, attributes, status, and duration.
Add a trace summary check that fails when a run has an agent output but no tool observations.
Design a redaction rule for trace metadata that may contain secrets or proprietary source snippets.
Write an incident-review query: given a run ID, what exact sequence of artifacts should an on-call engineer inspect?
Compare tracing requirements for local development, CI, shadow mode, and production traffic.

Checklist

A transcript is not enough.
Tool calls need success and latency fields.
Trace data may contain sensitive information and needs retention policy.
Every event should be attributable to a run.
Successful final answers still need process evidence.
Summaries should preserve enough detail for release gates.
Redaction policy should be designed before production traces exist.
Traces should support both debugging and governance questions.

OpenAI Agents SDK. 2025. Tracing. https://openai.github.io/openai-agents-python/tracing/.

OpenTelemetry. 2025. Semantic Conventions for Generative AI Systems. https://opentelemetry.io/docs/specs/semconv/gen-ai/.