16  Capstone: From Toy Agent to AgentProbe

Learning Objective

After this chapter, you should be able to turn the lab into a focused OSS project with a falsifiable first release.

Why This Matters

The common failure mode after learning agents is to build a broad demo. The stronger move is to extract one useful system boundary and make it reusable.

Core Concept

The book suggests four directions:

Project Focus
agentprobe Trace, eval, and production-readiness harness for agent runs
cachepilot Prompt-cache breakage analyzer
agentfit-mlx Local Apple Silicon agent workload profiler
toolguard Tool policy and approval layer

The strongest continuation of this book is probably agentprobe, because it follows the core arc:

build -> trace -> eval -> report -> deployment gate

This is author judgment, not a market claim.

First Release Shape

A focused agentprobe v0 could do three things:

  1. ingest a trace file,
  2. run deterministic checks,
  3. emit a production-readiness report.

That is enough to be useful. It does not need to be an agent framework, a tracing backend, an eval platform, or a dashboard. The first release should preserve the book’s thesis: make agent runs observable, evaluable, bounded, and reviewable.

Seven Starter Issues

  1. Define a trace event schema.
  2. Add a JSONL trace loader.
  3. Add deterministic schema/evidence checks.
  4. Add policy violation summaries.
  5. Add context-growth summaries.
  6. Generate Markdown and JSON reports.
  7. Provide fixture traces and failing examples.

Each issue should include a test and a sample artifact.

What Not to Build First

Do not start with a dashboard. Do not start with a multi-agent orchestrator. Do not start with provider abstraction. Those may become useful later, but they distract from the first useful contract: given a run artifact, can the tool explain what happened and whether the run passed local gates?

The strongest v0 is boring and inspectable. It should run locally, accept fixture traces, produce deterministic reports, and make failure cases easy to add.

Case Study Step

The case study already contains seed artifacts for agentprobe: a trace, eval results, policy configuration, context summary, and production report. The capstone is not speculative. It is the act of extracting the book’s evidence pattern into a focused tool.

A Minimal Interface

An agentprobe command-line interface might begin like this:

agentprobe inspect traces/run.jsonl --eval evals/buggy_calc.json --policy policy.yaml
agentprobe report traces/run.jsonl --out report.md

The first command would validate the run artifact. The second would produce a report. That is enough for a useful first release if the fixtures are good and the failure messages are precise.

Measurable Claim

The first release should make a narrow claim:

Given a JSONL trace and a deterministic eval specification, agentprobe produces a reproducible report that identifies missing evidence, failed evals, policy violations, and context warnings.

That claim is testable. It avoids broad statements about agent reliability. It also aligns with the evidence standard of this book.

Capstone Evaluation

Evaluate the capstone like the book evaluates the agent:

  • Does it parse valid traces?
  • Does it reject malformed traces?
  • Does it identify missing required events?
  • Does it summarize policy warnings?
  • Does it preserve deterministic output?
  • Does it produce a report that can be reviewed by a human?

The first user of agentprobe should be this repository. If it cannot improve this lab’s evidence story, it is not ready to generalize.

Staff Practice Notes

The capstone is tempting to overbuild because the domain is broad. Resist that. A narrow evidence tool can be more valuable than a broad agent framework because it fits into existing systems. The first release should make one run more reviewable.

Think like an external adopter. They already have orchestration, prompts, and tools. They may not want yours. But they do need to know whether a run had enough evidence, whether evals passed, whether policy was visible, and whether context risk exists. That is the opening.

Operational Invariants

The first public contract should be narrow enough to test completely. Trace input, schema validation, eval summary, policy summary, context warning, and report output are enough for a first release. Orchestration, UI, hosted services, and model adapters can wait.

The capstone should accept external artifacts without forcing framework migration. A user should be able to bring one trace from another agent framework, map required fields, and receive a useful evidence report. That is the portable value of the book.

The project should preserve fixture-driven development. Every new feature should have a small trace, eval, or policy fixture that demonstrates the behavior. Without fixtures, agentprobe would become another impressive but hard-to-review demo.

The Lab

python scripts/run_all_examples.py

Reading the Lab Output

The example runner prints all core artifacts in one JSON object: workflow evidence, agent result, trace summary, evals, and context summary. That output is essentially the seed data model for agentprobe. The capstone question is: how would you package this evidence pattern so another project can use it?

The answer should start with trace and report contracts, not a dashboard.

The combined output also reveals the extraction order. Trace and eval readers come first because they define evidence. Report generation comes next because it turns evidence into a review artifact. UI, integrations, and orchestration should wait until the evidence contract is stable.

Code Walkthrough

The capstone seed already exists: trace logger, eval runner, context profiler, policy object, and report generator. A real agentprobe would generalize those APIs beyond toy repos.

The extraction path should start with readers, not writers. Load a trace. Validate required fields. Summarize events. Load eval results. Render a report. These operations make existing agent systems more reviewable without asking users to change orchestration frameworks.

The second boundary is schema normalization. Different frameworks will use different trace shapes, but an evidence tool can define the minimum fields needed for review: run ID, event type, timestamp, tool call, success, latency, policy or capability context, and output validation. Unsupported fields should be reported explicitly rather than guessed.

The third boundary is fixture portability. The toy repos are small, but the pattern transfers: each fixture should encode a failure mode with expected evidence and pass/fail checks. agentprobe should ship with fixtures that prove its own claims before it asks users to trust it on their systems.

Expected Output

The command regenerates trace, eval, context, and production report artifacts that could become agentprobe fixtures.

The expected output shows why agentprobe should start as evidence tooling. The reusable object is the bundle of artifacts, not the toy diagnosis itself. If a future capstone cannot consume and explain this bundle, it has drifted from the book’s core claim.

Failure Mode

A capstone that tries to be a full agent framework will lose focus. A useful OSS project should have a narrow contract, test fixtures, and a measurable claim.

The symptom is scope diffusion. The project begins as evidence tooling, then grows orchestration, chat UI, plugins, hosted integrations, dashboards, and model adapters before the first external user can validate a trace. The broader it becomes, the harder it is to explain why it exists.

The root cause is confusing platform ambition with first-release value. The book’s reusable contribution is not another way to call tools. It is a disciplined evidence contract around agent runs. A narrow agentprobe release can load traces, evaluate schema completeness, summarize policy and context risks, and generate a report. That is enough to be useful and testable.

The artifact that exposes the failure is the first issue list. If the first seven issues are mostly infrastructure expansion and branding, the project is drifting. If they are trace loading, schema validation, fixture support, report generation, and documented limits, the project is preserving the book’s thesis.

Production Translation

The reusable value is not “another agent.” The reusable value is evidence around an agent run: what happened, whether it passed checks, and whether deployment gates should block.

The capstone also creates an article path. A strong article does not need to claim that agent frameworks are bad or that one tool is best. It can make a narrower claim: agent demos become more useful when every run produces trace, eval, policy, and context evidence.

For production users, the capstone should be useful even when they keep their existing agent framework. That requirement constrains the design. Accept trace artifacts. Validate schemas. Surface missing evidence. Generate a report. Avoid forcing orchestration migration. The first successful user story is not “rewrite your agent”; it is “bring one run artifact and leave with a reviewable evidence report.”

Design Review Questions

For the capstone project, ask:

  • What is the smallest useful contract?
  • What input artifacts are supported?
  • What output artifacts are generated?
  • What failure cases are included as fixtures?
  • What claim can the first release prove?
  • What is deliberately out of scope?
  • How does the tool improve this repo first?

A focused capstone should make the book’s evidence standard reusable.

Review Rubric

Reject a capstone plan that starts by building a broad framework, dashboard, or orchestration layer before the evidence contract is stable.

Require review when the project is useful only for this repo’s toy traces. The first external contract should be narrow but not repo-locked.

Accept the project direction when it can load a trace, validate schema, summarize eval and policy evidence, generate a report, and document unsupported cases without forcing framework migration.

Implementation Notes

If this repo becomes the seed for agentprobe, resist moving all code immediately. First add clean boundaries: trace loading, eval evaluation, report generation, and policy summary. Then extract those modules with tests. Extraction should preserve fixtures so behavior remains reproducible.

Good extraction is boring. It should make the current repo simpler while creating a reusable package.

Extension Path

The first agentprobe extraction should be read-only. Implement trace loading, trace validation, eval summary loading, policy summary loading, context warning loading, and report generation. Do not implement orchestration or model calls in the first release.

Then add one compatibility adapter for an external trace shape. The adapter should map required fields and explicitly report unsupported fields. This gives the project external value while keeping the core evidence contract small.

Worked Scenario: First External User

The first external user of agentprobe might have a trace from a different agent framework. They do not need a new orchestration system. They need to know whether the trace contains enough evidence to review a run.

That suggests a narrow onboarding path: map their trace into the schema, define one eval, generate one report. If that path works, the project has value. If it requires adopting a whole framework, the project has lost the book’s central discipline.

Chapter Synthesis

The capstone turns the book’s evidence pattern into a possible public tool. The goal is not to build a grand agent framework. The goal is to make agent runs easier to inspect, validate, summarize, and discuss across projects.

The strongest first release would be narrow: load traces, validate schemas, summarize evals and policy, report context risk, and generate a review artifact. That narrowness is not lack of ambition. It is how the project preserves the book’s core discipline.

Evidence and References

The project framing is author interpretation. The technical pieces are repo evidence. Tracing and risk-management references are OpenAI Agents SDK tracing, OpenTelemetry, and NIST AI RMF (OpenAI Agents SDK 2025; OpenTelemetry 2025; National Institute of Standards and Technology 2023).

Takeaways

  • The reusable capstone value is evidence infrastructure, not another broad framework.
  • Start with trace loading, schema validation, and report generation.
  • External users should gain value without migrating orchestration.

Exercises

  1. Score all four projects. Use criteria for evidence value, implementation risk, reader benefit, reusability, and ability to ship a narrow first release.
  2. Pick one first issue. Write the acceptance criteria, test plan, artifact output, and explicit non-goals.
  3. Write seven GitHub issues for agentprobe. At least three should be about tests or schemas rather than UI or demos.
  4. Draft a technical article title and abstract. Make a narrow evidence-backed claim rather than a broad claim about agents.
  5. Define the first public CLI contract for agentprobe. Include inputs, outputs, exit codes, and deterministic sample fixtures.
  6. Map this repo’s trace, eval, policy, context, and report modules to extraction boundaries. Identify one boundary that should not be extracted yet.
  7. Design a compatibility layer for traces from a different agent framework. State the required fields and unsupported cases.
  8. Write a release checklist for version 0.1.0: docs, fixtures, tests, sample reports, provenance of references, and known limitations.

Checklist

  • Pick a narrow OSS contract.
  • Make the first release measurable.
  • Prefer reusable evidence infrastructure over a broad demo.
  • Preserve deterministic fixtures during extraction.
  • Start with trace and report interoperability before orchestration.
  • Public claims should be supported by local artifacts or primary sources.
  • The capstone should improve this repo before becoming a separate project.
  • First external users should succeed without adopting a new agent framework.
National Institute of Standards and Technology. 2023. Artificial Intelligence Risk Management Framework (AI RMF 1.0). https://doi.org/10.6028/NIST.AI.100-1.
OpenAI Agents SDK. 2025. Tracing. https://openai.github.io/openai-agents-python/tracing/.
OpenTelemetry. 2025. Semantic Conventions for Generative AI Systems. https://opentelemetry.io/docs/specs/semconv/gen-ai/.