7 Evaluating Agents

Learning Objective

After this chapter, you should be able to define deterministic eval checks for process correctness and task success.

Why This Matters

A demo can succeed once and still be unfit for deployment. The repo’s evals intentionally avoid LLM-as-judge in the core path. They check observable properties: schema validity, expected file inspection, expected keywords, hallucinated file references, and invalid tool calls.

Core Concept

Evaluation should cover both final output and process:

Eval Dimension	Local Example
Schema validity	final JSON has required keys
Evidence	expected file inspected
Keyword match	“division” or “prompt-injection” appears
Grounding	inspected files are allowed files
Tool validity	policy violations counted
Context budget	large dynamic outputs flagged by reports

This is a local design, not a universal eval framework. The value is that every check is falsifiable.

Process Correctness

Many evals focus only on the final answer. Agentic systems need process checks because the path is part of the behavior. If a repo triage system identifies the right bug while citing a file it never inspected, the output may be lucky rather than grounded. If it flags prompt injection but also attempts a blocked path traversal, the final answer may look safe while the runtime behavior is unsafe.

The local eval suite uses small checks because they are easy to trust:

schema_ok says the result can be consumed downstream.
expected_file_inspected says the agent looked at required evidence.
expected_keyword_present says the diagnosis contains the expected concept.
hallucinated_file_count says whether inspected-file claims are grounded.
invalid_tool_call_count says whether policy violations occurred.

These checks do not prove general correctness. They prove that a narrow set of observable properties held for the fixture.

Eval Dataset Shape

The three current fixtures each exercise a different dimension:

buggy_calc: task diagnosis and evidence.
prompt_injection_repo: untrusted content and policy confinement.
noisy_logs_repo: context bloat and report warnings.

A production eval set should include happy paths, known regressions, adversarial fixtures, large-context cases, and malformed-output cases. It should also preserve failures. A failing fixture is often more valuable than a polished demo because it tells the team what the gate is supposed to catch.

False Confidence

Small eval suites are useful but dangerous if overinterpreted. Three fixtures cannot prove broad readiness. They can prove that three known contracts still hold. That is enough to catch regressions and teach evaluation structure, but not enough to claim general reliability.

The evidence policy matters here. Say “the default local eval suite passes” rather than “the agent is reliable.” The second claim is broader than the evidence.

Case Study Step

The three toy repos form the first regression set. buggy_calc protects the diagnosis path. prompt_injection_repo protects the safety/policy path. noisy_logs_repo protects the context-warning path. A future contributor can add fixtures without changing the book’s central pattern: every new risk should become a runnable case.

Eval Maintenance

Eval suites decay if they are not maintained. Fixtures can stop representing real failures. Checks can become too easy. Reports can remain green while the product changes underneath them. A staff-level owner should treat eval maintenance as product work, not test cleanup.

A useful maintenance loop is:

capture real failures as fixtures,
write deterministic checks where possible,
add model-judge or human-review checks only where deterministic checks are insufficient,
keep old failures unless they are obsolete for a documented reason,
track which evals are release gates.

The local suite is small enough to inspect by hand. That is a feature for a learning repo. In a real system, the same principles apply at larger scale.

What to Do With Failing Evals

A failing eval is not automatically bad news. It is information. The important question is whether the failure is expected, newly introduced, or obsolete. Expected failures may document known gaps. Newly introduced failures may block release. Obsolete failures should be removed only with an explanation.

The report should make this distinction eventually. A production eval system often needs metadata such as owner, severity, fixture source, expected status, and release-blocking flag. The local eval schema is simpler, but it prepares the reader to ask for that metadata.

Staff Practice Notes

For agent evals, the first question is not “what score did we get?” It is “what behavior would this eval catch?” A small eval that catches a real grounding regression is more valuable than a large score that no one knows how to act on. Tie every eval to a failure mode and an owner.

Keep eval failures socially useful. If a failure only says “bad answer,” engineers will debate the model. If it says “expected file not inspected” or “invalid tool call occurred,” engineers know where to look. Good eval design reduces blame and increases repair speed.

Operational Invariants

Every eval should encode a failure mode, not just a desired answer. buggy_calc encodes behavioral mismatch, prompt_injection_repo encodes untrusted-content handling, and noisy_logs_repo encodes context pressure. That mapping is what makes failures actionable.

Every eval should separate answer quality from process quality when the process matters. A correct final sentence can still fail if required evidence was not inspected or if the answer names unsupported files. Process checks are essential for agentic systems because tool use is part of the behavior being deployed.

Every critical eval should have an owner and a failure action. A red test without an owner becomes background noise. A warning without a failure action becomes decoration. Release gates should say which failures block, which require human review, and which are informational.

The Lab

python -m agentic_systems_lab.evals

Reading the Lab Output

The eval command prints a list of task results. Each result is a set of checks, not a single grade. A failure in schema_ok means the output contract broke. A failure in expected_file_inspected means grounding is weak. A nonzero hallucinated_file_count means the result referenced files outside the allowed set.

This diagnostic shape matters. A useful eval should tell the engineer what to fix next.

Read failures as subsystem pointers. A grounding failure does not necessarily mean the model is weak; it may mean file selection skipped the relevant test. A hallucinated-file failure may point to output validation. An invalid-tool-call failure may point to policy or prompt/tool mismatch. Good evals reduce the search space for debugging.

Code Walkthrough

EvalTask names a fixture repo, expected file, expected keyword, and allowed files. evaluate_result computes booleans and counts, then derives passed. The default suite covers buggy_calc, prompt_injection_repo, and noisy_logs_repo.

The design is intentionally explicit rather than statistical. Each field encodes a contract: the output must have the right schema, the agent must inspect expected evidence, the finding must contain a task-relevant keyword, inspected files must stay within the allowed set, and invalid tool calls must be absent. The pass/fail result is a conjunction of those checks.

This is not enough for broad model evaluation, but it is exactly the right shape for a regression harness. A failed boolean points to a subsystem. Schema failure points to output construction. Missing expected file points to evidence selection. Hallucinated file count points to grounding. Invalid tool call count points to policy.

The default tasks also demonstrate fixture diversity. One fixture is a normal bug diagnosis, one is adversarial content, and one is context pressure. A small suite with different failure surfaces is more useful than three near-identical bug examples.

Expected Output

The command prints three passing eval results and writes reports/sample_eval_report.md.

The passing status should not be read as broad model quality. It means the deterministic agent satisfied the current fixture contracts. The stronger evidence is the per-check breakdown, because that breakdown explains what would fail if grounding, schema, or policy behavior regressed.

Failure Mode

An agent can produce plausible text without inspecting the right file. It can also cite a file that does not exist. Those failures are easy to miss in a transcript and easy to catch with deterministic checks.

The symptom is a lucky answer. The final summary is correct enough for a quick demo, but the process was weak. The agent may have guessed from a filename, copied stale context, or relied on a generic pattern. If the eval only scores final text, the run passes despite poor evidence selection.

The root cause is evaluating fluency instead of contract. Agent evals should check schema, expected evidence, unsupported file claims, invalid tool calls, and fixture-specific safety behavior. Natural-language quality matters, but it is not the first gate for a system that can inspect tools and produce artifacts.

The artifact that exposes the failure is a process-aware eval result. expected_file_inspected=false has a different remediation path from expected_keyword_present=false. hallucinated_file_count>0 points to grounding or validation. invalid_tool_call_count>0 points to policy. A useful eval report should tell the engineer which subsystem to inspect next.

Production Translation

Deployment should be gated by eval results. For higher-risk systems, deterministic checks should be complemented by model-judge evals, human review, red-team cases, or online monitoring. That broader recommendation is author judgment; the repo only demonstrates the deterministic layer.

An eval gate should specify what happens on failure. Does deployment block? Does the run require human review? Does a failure open an issue? Does it page an owner? Without an action, an eval is just a report. The local passed field is intentionally simple, but the production question is organizational: who owns the failed gate?

Ownership matters. Every eval should have an owner, fixture source, update policy, and failure action. Otherwise evals decay into stale comfort. When a real incident occurs, convert it into a fixture or explain why it cannot be represented. When a product requirement changes, update expected behavior and preserve the old fixture if it still represents a risk. The eval suite is a living regression surface, not a one-time launch checklist.

Design Review Questions

For evals, ask:

What failure mode does each fixture represent?
Which checks are deterministic?
Which checks require human or model judgment?
Which failures block release?
Which failures require review but not blocking?
Who owns fixture updates?
How are real incidents converted into evals?
How do evals compare agent behavior to a workflow baseline?

If an eval does not connect to a failure action, it is weaker than it looks.

Review Rubric

Reject evals that score only final answer style for a tool-using system. They ignore the process being deployed.

Require review when evals pass but do not cover grounding, hallucinated files, invalid tool calls, or known adversarial fixtures. Passing narrow checks should narrow rollout authority.

Accept the eval suite when every critical fixture maps to a failure mode, every failure has an owner and action, and report output separates correctness, safety, context, and deployment implications.

Implementation Notes

The next eval improvement is to separate task checks from deployment gates. A task can pass while a deployment gate blocks due to context growth or policy warnings. The current report already demonstrates this distinction informally. A richer schema could expose task_passed, safety_passed, context_passed, and deployment_status.

That split prevents a common confusion: “the answer was correct” does not mean “the run is deployable.”

Extension Path

Split eval results into categories: schema, grounding, safety, context, and deployment. Keep the existing passed field for backward compatibility, but add category-level status so reports can explain why a task is correct but not deployable.

The first new gate should use existing evidence: context warnings from noisy_logs_repo and policy violations from the policy object. This avoids inventing a large eval framework while making the deployment decision more precise.

Worked Scenario: A Lucky Answer

An agent might output the correct diagnosis for buggy_calc without inspecting test_calculator.py. A human reading only the final answer might be satisfied. The eval should not be. The expected-file check exists because the process matters.

This is not pedantry. In a larger repo, a lucky answer may fail on the next case. Grounding checks are a way to prevent final-answer fluency from hiding weak evidence. They do not prove complete correctness, but they raise the floor.

Chapter Synthesis

Agent evals should measure contracts, not vibes. The local suite checks schema, expected evidence, keyword presence, hallucinated files, and invalid tool calls because those dimensions map to concrete failure modes. This makes failures actionable rather than merely disappointing.

The broader lesson is that evals are part of system design. They define what the team cares about, what regressions are unacceptable, and what evidence is required for rollout. A demo can impress; an eval can block.

Evidence and References

The eval behavior is repo evidence from tests/test_evals.py and reports/sample_eval_report.md. NIST AI RMF frames measurement and risk management as part of AI governance (National Institute of Standards and Technology 2023).

Takeaways

Agent evals should test process and evidence, not only final text.
Small regression fixtures are valuable when each maps to a concrete failure mode.
A failed eval should point to the next subsystem to inspect.

Exercises

Add a failing eval and inspect the report. Confirm that the failure is visible in both task-level results and deployment recommendation.
Add a context-budget check. Decide whether exceeding the budget fails the task, blocks deployment, or requires human review.
Add an eval for a new fixture repo. Include expected files, expected keywords, hallucinated-file checks, and invalid-tool-call checks.
Decide which evals would block deployment. Separate correctness, grounding, safety, context, and artifact-generation gates.
Add an eval that catches a lucky answer: correct final issue, missing required evidence file.
Create a small incident-to-eval conversion template. Include incident summary, root cause, fixture construction, expected behavior, and regression owner.
Compare aggregate pass rate to per-risk-category pass rate. Explain which one should appear in a release gate.
Design an eval freshness policy so stale fixtures do not become misleading comfort.

Checklist

A successful demo is not an eval.
Evaluate process, not only final text.
Make pass/fail criteria explicit.
Expected evidence should be part of the eval.
Safety checks and correctness checks should be reported separately.
Deployment gates should treat missing artifacts as failures.
Real incidents should become regression fixtures.
Aggregate scores should not hide high-severity failures.

National Institute of Standards and Technology. 2023. Artificial Intelligence Risk Management Framework (AI RMF 1.0). https://doi.org/10.6028/NIST.AI.100-1.