8 Reports and Production-Readiness Evidence

Learning Objective

After this chapter, you should be able to explain how raw traces, evals, policy warnings, and context growth become a review artifact.

Why This Matters

Raw JSONL is useful for debugging. It is not enough for a staff engineer deciding whether a system is ready for wider rollout. A report compresses evidence into a decision surface.

Core Concept

The report generator summarizes:

trace event counts,
tool-call counts,
eval pass/fail status,
policy warnings,
context growth,
large-output warnings,
deployment recommendation.

The recommendation is intentionally conservative: a large dynamic output can require human review even when evals pass.

What a Report Proves

The sample report proves only what it summarizes:

a trace was available,
the run had a fixed event count,
the eval suite passed,
no policy violations were recorded,
context growth included a large dynamic output,
the deployment recommendation required review.

It does not prove that the system is production-ready. It does not prove model quality. It does not prove security. It is a structured piece of evidence that a reviewer can inspect.

That distinction is important. Evidence artifacts should reduce ambiguity, not become slogans.

Review Workflow

A staff-level review can use the report as an entry point:

Check deployment recommendation.
Inspect failed evals or review-required warnings.
Open the trace if event counts look suspicious.
Inspect policy warnings.
Inspect context growth and large outputs.
Decide whether the gate blocks, requires approval, or allows rollout.

The report should not hide raw artifacts. It should point to them.

Report Granularity

Reports should be concise enough to read and specific enough to act on. A hundred-page dump of trace lines is not a report. A green checkmark without evidence is not a report either. The sample report chooses the middle: a small summary with enough structure to identify trace, eval, policy, and context signals.

For production, add links rather than copying everything. Link to raw traces, eval task definitions, policy versions, and artifact hashes. A report is the table of contents for evidence.

Case Study Step

The sample production report is deliberately not all green. The evals pass, but the report still requires human review because large dynamic context appears. That is the desired lesson. Production evidence should be capable of saying, “the task succeeded, but rollout still needs review.”

Report Status Vocabulary

A report should avoid vague labels. Useful statuses include:

passed: all required gates passed,
human_review_required: no automatic approval because a warning exceeded threshold,
blocked: a release gate failed,
not_evaluated: required evidence is missing,
invalid: the report itself could not be generated reliably.

The local report currently uses human-review language rather than a formal enum. A production version should make the status machine explicit. The status should be machine-readable, but the report should also explain why.

Report Consumers

Different readers need different report views. A developer wants failing checks and trace links. A tech lead wants deployment status and ownership. A security reviewer wants policy violations and risky tools. An ML engineer wants context growth, model behavior, and eval coverage. An incident reviewer wants reproducibility.

The sample report is intentionally one page. A production report could have sections or output formats for different consumers while preserving one underlying evidence model.

Staff Practice Notes

A report is a decision interface. If it does not change what a reviewer can approve, block, or investigate, it is probably just formatted logging. Start with the decision vocabulary, then decide which evidence must appear for each decision.

Reports also discipline teams against demo optimism. A system can pass task evals and still require review because context is too large, policy changed, or traces are incomplete. That is not bureaucratic caution; it is the distinction between task behavior and deployment behavior.

Operational Invariants

A report should derive from artifacts rather than reinterpreting behavior from scratch. Trace summaries, eval results, policy records, and context profiles should feed the report. If the report has its own independent logic for deciding what happened, it can drift from the evidence it claims to summarize.

A report should use stable status vocabulary. “Passed,” “human review required,” and “blocked” mean different things and should be machine-readable if they gate deployment. Free-form prose is useful for humans, but release automation needs fields with conservative semantics.

A report should distinguish runtime variance from deterministic evidence. Latency may vary locally. Timestamps may vary by run. Committed sample reports should avoid needless churn, while runtime traces can preserve local details. This distinction keeps version control useful without hiding operational facts.

The Lab

python -m agentic_systems_lab.report

Reading the Lab Output

The report command prints the path of the generated Markdown report. Open the report rather than relying on the command line. The report’s value is in combining signals: trace summary, eval results, policy warnings, context growth, and deployment recommendation.

The sample recommendation requires human review because large dynamic context appears. That is not a failure of the report. It is the report doing its job.

A report should be read from decision to evidence. Start with the deployment recommendation, then inspect the reasons, then follow links or paths back to raw artifacts. If the report cannot support that navigation, it is a summary without auditability.

Code Walkthrough

generate_report accepts trace summary, eval results, context summary, and policy violations. It creates Markdown sections and derives a deployment recommendation. This is intentionally simple so the reader can alter the gate logic.

The report generator is a reducer over evidence artifacts. It should not discover new facts by inspecting the repository again. Trace summary says what happened, eval results say what passed, policy violations say what boundary was exercised, and context summary says whether prompt pressure requires review.

The deployment recommendation is conservative on purpose. A run can have passing evals and still require human review because context warnings exist. This is the main lesson of the report chapter: task correctness is not the same thing as deployment readiness.

If you extend the report, prefer adding structured inputs before adding prose. For example, add a deployment_status enum and reason list before writing a new paragraph. Human-readable Markdown can then render the structured decision, while CI or release tooling can consume the same fields.

Expected Output

reports/sample_production_report.md should include all three eval tasks and a deployment recommendation of human review required before deployment when the noisy log fixture produces a large dynamic output.

The expected recommendation is intentionally not ready. That matters. A report that only celebrates passing task evals would hide context risk. The sample report teaches that deployment recommendations can be conservative even when core task behavior passes.

Failure Mode

If a team only stores raw traces, reviewers have to manually reconstruct the run. If a team only stores the final answer, they cannot reconstruct it at all.

The symptom is review friction. A release reviewer asks whether the run is safe to deploy, and the answer is a directory of JSONL files, test logs, and hand-written notes. The data may be present, but the decision is still manual and inconsistent. Conversely, a single polished final answer hides the raw evidence entirely.

The root cause is missing decision modeling. Reports should not be formatted log dumps. They should map evidence to decisions: task correctness, grounding, policy status, context budget, warnings, and deployment recommendation. A report can be concise only if the underlying artifacts are structured enough to summarize.

The artifact that exposes the failure is a report that cannot explain its status. If the report says “human review required,” it should name the warning. If it says “blocked,” it should name the failed gate. If it says “ready,” it should point to evals, traces, and policy evidence. Anything less leaves release decisions dependent on reviewer intuition.

Production Translation

Reports can become pull-request comments, release artifacts, incident attachments, or deployment gates. In production, the report should link to raw traces and preserve enough metadata to reproduce the run.

A mature version would include run environment, code version, model version, policy version, eval suite version, and artifact hashes. Those are omitted here to keep the lab inspectable, but they are necessary for serious release review.

For deployment, reports should have two audiences. Humans need a concise narrative: what ran, what passed, what warned, what blocks. Automation needs stable fields: status, reason codes, artifact paths, check results, and policy warning counts. Both views should come from the same data model. If the human report and CI gate can disagree, the report has become another source of production ambiguity.

Design Review Questions

For reports, ask:

Who is the primary reader?
What decision should the report support?
What raw artifacts does it summarize?
What artifacts does it link to?
What status vocabulary does it use?
What makes the report invalid?
Which warnings require human review?
Which failures block deployment?

A report should be designed around decisions. Otherwise it becomes a formatted log dump.

Review Rubric

Reject reports that are formatted final answers. A production report must summarize evidence, not merely restate the agent’s conclusion.

Require review when reports are human-readable but lack stable status fields or artifact references. A reviewer can read them, but automation and incident response will struggle.

Accept the report when it derives from trace, eval, policy, and context artifacts; uses conservative status vocabulary; and points from recommendation back to concrete evidence.

Implementation Notes

The next report improvement is dual output: Markdown for humans and JSON for automation. The Markdown report should remain readable. The JSON report should contain stable fields for CI or release tooling. Both should derive from the same internal data model so they cannot disagree.

Once a report has a machine-readable status, it can become a real gate.

Extension Path

Add a JSON report beside the Markdown report. The Markdown report should remain optimized for humans. The JSON report should expose stable fields for CI: eval statuses, policy warning count, context warning count, artifact paths, deployment status, and reason codes.

Both reports should be generated from the same internal data model. If the Markdown and JSON paths diverge, the system will eventually produce conflicting release evidence. Start by testing the shared data model, then test each rendering.

Worked Scenario: Passing Evals, Review Required

The current sample report is intentionally mixed. All evals pass. No policy violations are recorded. Yet the deployment recommendation requires human review because noisy_logs creates a large dynamic context segment.

This is a realistic outcome. Many systems are correct on task behavior but still not ready for broader rollout. Maybe the next step is summarization. Maybe the output cap should be lower. Maybe log retrieval should be indexed. The report does not decide the architecture. It makes the decision unavoidable.

Chapter Synthesis

Reports exist because raw artifacts are not enough for decisions. A trace can be complete, evals can be detailed, and policy can be serialized, yet a reviewer still needs a concise statement of status and reasons. The report is that decision layer.

The important nuance is that a report should not weaken evidence into prose. It should preserve links back to trace, eval, policy, and context artifacts. Good reports make decisions faster without making them less auditable.

Evidence and References

Report behavior is local evidence from reports/sample_production_report.md and tests/test_report.py. Risk-gating language is aligned with the measurement and management posture in NIST AI RMF (National Institute of Standards and Technology 2023).

Takeaways

Reports turn runtime artifacts into release decisions.
A passing eval suite can still produce a human-review deployment status.
Human-readable and machine-readable reports should share one data model.

Exercises

Add a cost-cap warning. Define the estimated-cost field, threshold, warning message, and test fixture.
Add a human-approval-required section. Make it appear only when a concrete warning category is present.
Add a release-blocking status when evals fail. Verify that the status changes even if the trace and policy sections are present.
Convert the Markdown report into JSON. Keep both renderings backed by the same internal data model.
Add a stable-report regeneration test that fails if committed sample reports drift from deterministic generator output.
Design a report diff for release review. State which fields should be compared across builds and which runtime fields should be ignored.
Add a policy-warning table grouped by tool, violation type, and affected run.
Write a one-page production review using only generated artifacts from this repo.

Checklist

Reports are decision artifacts.
A passing eval can still require review.
Link summaries back to raw evidence.
Human-readable and machine-readable reports should agree.
Runtime latency should not churn committed deterministic artifacts.
Deployment status should be conservative when evidence is missing.
Report sections should map to owner actions.
A report should make the release decision reviewable.

National Institute of Standards and Technology. 2023. Artificial Intelligence Risk Management Framework (AI RMF 1.0). https://doi.org/10.6028/NIST.AI.100-1.