5 Building the Repo Triage Agent

Learning Objective

After this chapter, you should be able to explain the deterministic agent-shaped runtime and its structured output contract.

Why This Matters

The first agent in this book is intentionally deterministic. That keeps the runtime inspectable before model variance enters the picture. The goal is not to simulate intelligence; the goal is to expose the boundaries an LLM-backed version would inherit.

Core Concept

The runtime controls:

run start,
policy check,
file listing,
bounded file reads,
context observations,
deterministic diagnosis,
final structured output,
run finish.

The output schema is:

{
  "summary": "...",
  "files_inspected": ["..."],
  "findings": [{"file": "...", "issue": "...", "evidence": "...", "confidence": 0.92}],
  "recommended_next_step": "..."
}

Runtime Walkthrough

The deterministic agent has six stages.

First, it records agent_start. This creates a run boundary. Second, it records the active ToolPolicy, which tells a reviewer what the run was allowed to do. Third, it lists files and reads selected files through traced tool calls. Fourth, it records context observations with estimated token counts. Fifth, it applies deterministic diagnosis rules. Sixth, it records agent_finish and returns structured JSON.

Nothing in that sequence requires an LLM. That is the point. The runtime contract can be inspected independently from model behavior. A later model-backed implementation can replace the deterministic diagnosis stage, but it should not remove policy, tracing, evals, or output structure.

Why Structured Output Comes Early

Structured output is not just for downstream convenience. It defines what evals can check. If the agent returns a paragraph, the eval has to parse prose or rely on a judge. If the agent returns files_inspected, the eval can check grounding. If it returns findings[*].file, the eval can count hallucinated files. If it returns confidence, the runtime can later define thresholds.

The schema in this repo is small, but it encodes important questions:

What is the summary?
Which files were inspected?
What finding was made?
What evidence supports it?
How confident is the system?
What should happen next?

Those fields are enough for deterministic tests and a production report.

From Rules to Models

The deterministic diagnosis rule is intentionally replaceable. A model-backed version could decide which files to inspect and how to summarize findings. The runtime should still validate the same output schema and run the same evals.

That replacement boundary is useful. If the LLM version fails an eval that the deterministic version passes, the failure is not in tools, policy, or report generation. It is in model selection, prompt design, file selection, or output control. Good architecture narrows the search space for failures.

Case Study Step

On buggy_calc, the agent reads the same evidence surfaced by the workflow and emits the diagnosis as structured JSON. That is the first moment the system becomes useful as an artifact producer rather than only an inspector. The output can be fed into evals, reports, or future pull-request comments.

The case study remains deterministic because the book is still teaching runtime shape. Once the runtime is solid, a model can be introduced as one replaceable decision component.

Failure Result Design

The current schema assumes the agent returns at least one finding. A production version should also define failure results. Examples:

no files visible under policy,
expected files missing,
output cap prevented diagnosis,
policy violation blocked required evidence,
diagnosis confidence below threshold,
schema validation failed.

Those conditions should not be collapsed into vague prose. They should be structured because they drive different actions. Missing files may indicate the wrong repository path. Output caps may indicate the need for summarization. Low confidence may require human review. Schema failure may indicate a prompt or model issue.

Adding explicit failure results is one of the easiest ways to make an agentic runtime more supportable.

Confidence and Calibration

The deterministic agent returns a confidence value, but the value is a rule confidence, not a calibrated probability. That distinction matters. A model-backed agent may output confidence text that feels precise but has not been calibrated against observed outcomes.

For this lab, confidence is used only as a structured field. It should not be interpreted as a reliable probability. A production version would need calibration data or would need to replace confidence with more concrete status fields such as requires_human_review, evidence_complete, or diagnosis_blocked.

Implementation Review

When reviewing the agent implementation, separate runtime code from diagnosis code. Runtime code should enforce policy, trace tools, and validate output. Diagnosis code can be deterministic rules, model calls, or a hybrid. Mixing those layers makes it harder to replace model behavior without weakening the runtime.

The current implementation is small enough that the separation is conceptual rather than architectural. In a larger system, you might make the diagnosis step an interface. The deterministic implementation becomes one strategy, the LLM-backed implementation another, and tests can run both against the same fixtures.

Staff Practice Notes

The first useful agent in a repo should be boring. It should inspect a small fixture, produce a typed result, write a trace, pass evals, and avoid mutation. That may feel underpowered, but it creates the harness for every serious model experiment that follows.

When replacing deterministic diagnosis with a model, resist rewriting the runtime around the model’s preferred shape. Keep the model behind an adapter. Validate output. Preserve trace and eval semantics. If the model is good, it will survive those constraints. If it cannot, the failure is useful evidence.

Operational Invariants

The agent output schema should not depend on the diagnosis strategy. A deterministic rule, hosted model, local model, or human-in-the-loop strategy can vary internally, but the runtime should still emit the same structured fields. This lets evals and reports remain stable while experimentation happens behind a narrow boundary.

The agent should record enough evidence to explain both success and failure. A successful diagnosis should show inspected files and evidence. A failure should show whether the issue was policy, missing files, malformed output, low confidence, or context budget. Silent fallback from one failure type to another makes incident review harder.

The agent should not mutate the target repo in the core path. Read-only behavior keeps the lab reproducible and keeps the first contract narrow. If mutation is added later, it should arrive through proposed patches, approvals, post-action verification, and rollback artifacts.

The Lab

python -m agentic_systems_lab.agent

Reading the Lab Output

The output should be read as a contract instance. summary is a short diagnosis. files_inspected is a grounding claim. findings is the evidence-bearing list. recommended_next_step is the proposed action. Each field can be tested or reviewed.

The result is intentionally not a conversational paragraph. A paragraph may be nicer to read, but structured JSON gives the eval runner and report generator something concrete to inspect.

Check the evidence field with a skeptical eye. It should be specific enough to explain the finding, but not so large that it becomes a hidden context dump. In production, evidence should help a human verify the result quickly and help an eval distinguish grounded output from plausible narration.

Code Walkthrough

run_repo_triage first creates ToolPolicy, then uses traced tool calls for list_files and read_file. Diagnosis is rule-based: buggy_calc is flagged when implementation returns a / b while tests/documentation describe different zero-division behavior.

The orchestration order is the key design. The run starts a trace before tool calls. The active policy is recorded before evidence is gathered. File listing happens before file reads. File reads become context observations. Diagnosis happens after evidence exists. The final result is structured and traceable.

The diagnosis rule is intentionally replaceable. It is not meant to be the clever part of the system. A future LLM strategy could inspect the same evidence and emit the same schema. The rest of the runtime should not care whether the finding came from deterministic pattern matching, a hosted model, a local model, or a human reviewer.

That is the useful boundary for experimentation. If a new strategy improves findings while preserving schema, trace, policy, eval, and report behavior, the experiment is easy to review. If it requires changing every artifact shape, the model experiment is probably coupled too deeply into the runtime.

Expected Output

The result should identify calculator.py and report a division-by-zero contract mismatch. A trace is written under traces/.

The expected finding should include evidence, not only a file name. A useful diagnosis tells the reader why calculator.py matters and how the surrounding tests or documentation create the mismatch. A file-only result is weaker because it forces the reviewer to rediscover the argument.

Failure Mode

An LLM-backed version could inspect irrelevant files, hallucinate filenames, ignore evidence, over-read context, or emit malformed JSON. Those are not hypothetical claims about all LLMs; they are failure dimensions this repo can evaluate.

The symptom is output that looks agentic but breaks the contract. The diagnosis may be plausible while files_inspected omits the decisive test file. The result may include a filename that does not exist. The answer may explain the bug in prose but fail the JSON schema. The run may read every file because selection logic is weak, turning a small task into unnecessary context growth.

The root cause is putting model behavior before runtime shape. The agent contract should exist before the model strategy changes: schema, allowed tools, trace events, eval tasks, policy, and report status. Once those artifacts are fixed, a model-backed strategy can be evaluated as a strategy. Without them, every model failure becomes a bespoke debugging session.

The artifact that exposes the failure is the combination of structured output and eval result. Schema checks catch malformed answers. Expected-file checks catch weak grounding. Hallucinated-file checks catch unsupported claims. Context warnings catch over-reading. The deterministic agent is intentionally simple because it gives all of those checks a stable target.

Production Translation

Before deploying an LLM version, the deterministic skeleton should already have tools, policy, tracing, evals, context accounting, and report generation. Model quality is not a substitute for runtime evidence.

The production version would need more layers: timeout handling, retries, structured validation, model refusal handling, policy violation events, artifact retention, and human review for uncertain diagnoses. But those layers should extend the skeleton rather than replace it.

One useful design review question is: if the model decision were wrong, would the system have enough evidence to show why? In this lab, the trace shows files read, policy, context observations, and finish status. The eval shows whether required evidence appeared. That is the minimum shape of a debuggable run.

A production rollout should separate advisory and mutating modes. In advisory mode, the agent can inspect, diagnose, and generate a report for a human. In mutating mode, it can post comments, open issues, or propose patches. Those modes should not share the same gate. Advisory mode may require schema, trace, and eval evidence. Mutating mode additionally needs approval policy, idempotency, rollback, rate limits, and user-facing error handling.

Design Review Questions

For the agent runtime, ask:

What starts the run?
What policy is active?
Which tools can be called?
What selects files or evidence?
What output schema is required?
What happens on malformed output?
What happens on low confidence?
What trace fields prove the run path?
What evals gate the result?

The diagnosis step can evolve from rules to models, but these runtime questions should remain stable.

Review Rubric

Reject an agent runtime that returns prose only, mutates the repository in the default path, or hides tool selection from traces.

Require review when output is structured but confidence, failure categories, or low-evidence behavior are undefined. A schema without failure semantics is only half a contract.

Accept the runtime when strategy choice is isolated, output schema is stable, policy is recorded, tool calls are traced, evals check evidence, and no mutation happens without an explicit feature and gate.

Implementation Notes

The next implementation step would be to isolate diagnosis behind a function or strategy object. The deterministic strategy would remain the default. An LLM strategy could be optional and would have to return the same schema. Tests would run against the deterministic strategy; integration tests could exercise the LLM strategy only when credentials are explicitly available.

This keeps the core repo deterministic while leaving a clean extension point.

Extension Path

The next implementation slice should isolate diagnosis behind a strategy protocol. Keep the deterministic strategy as default, add a stub strategy that returns a known result, and prove both satisfy the same schema. This creates a testable seam for later model experiments.

Then add strategy metadata to traces and reports. A reviewer should know whether a finding came from deterministic rules, a local model, a hosted model, or a human reviewer. Strategy identity is not just debugging detail; it determines which evals and rollout gates apply.

Worked Scenario: Replacing the Diagnosis Rule

Assume you replace _diagnose with an LLM call. The rest of the runtime should barely change. The agent still starts a trace. It still records policy. It still calls read-only tools. It still records context observations. It still validates the output schema. It still runs evals.

If the LLM version inspects more files and finds a better issue, the trace and evals should show that improvement. If it hallucinates security.py, the hallucinated-file count should catch it. If it emits prose instead of JSON, schema validation should catch it. If it tries to follow instructions from prompt_injection_repo, policy should limit the blast radius.

That is how deterministic scaffolding makes model experimentation safer. It gives the experiment a harness.

Chapter Synthesis

The repo triage agent is deterministic because the runtime contract matters more than model novelty at this stage. It shows that an agent-shaped system can have tools, policy, trace evidence, context observations, structured output, and eval hooks before any external model is introduced.

That is the right order for serious experimentation. Once the harness exists, a model-backed diagnosis strategy can be evaluated inside it. Without the harness, every model result has to be judged manually, and every failure becomes ambiguous.

Evidence and References

The agent architecture is repo evidence. The idea that agent SDKs can coordinate tools, guardrails, handoffs, and traces is supported by OpenAI’s Agents SDK docs (OpenAI 2025).

Takeaways

The deterministic agent is a harness for later model experiments.
Strategy internals can vary only if schema, policy, trace, and eval contracts stay stable.
Read-only behavior keeps the first agent reproducible and reviewable.

Exercises

Add a second deterministic diagnosis rule. Start with a failing test that names the expected file, issue, evidence phrase, and confidence.
Add a fixture repo for a different bug class. Keep the fixture small enough that every relevant file can be inspected by a reviewer.
Add a confidence threshold. Define what happens when all findings fall below it and how the report should render that status.
Add a failure result schema. Include failure category, user-visible summary, trace path, and recommended next diagnostic step.
Refactor diagnosis behind a strategy interface. Verify that the deterministic strategy and a stub model strategy return the same output schema.
Add a no-mutation assertion for an agent run. The test should compare fixture file contents before and after execution.
Extend trace creation to record which strategy produced the diagnosis. Keep the field deterministic in committed sample artifacts.
Write the integration test you would require before allowing an optional LLM diagnosis strategy into CI.

Checklist

The first agent is deterministic on purpose.
Structured output is part of the runtime contract.
Trace and eval behavior are not optional add-ons.
Diagnosis strategy can vary; artifacts should not.
Fixture repos should encode concrete failure modes.
Low confidence needs an explicit runtime outcome.
Agent runs should be non-mutating unless mutation is the feature.
Optional model paths must not weaken deterministic tests.

OpenAI. 2025. Agents SDK. https://developers.openai.com/api/docs/guides/agents.