3 Designing Tools

Learning Objective

After this chapter, you should be able to specify a tool contract that is deterministic, bounded, and safe by default.

Why This Matters

Tools are capabilities. A file reader can expose secrets. A shell tool can become code execution. A write tool can mutate durable state. The OWASP LLM Top 10 treats excessive agency and prompt injection as relevant risks for LLM applications (OWASP Foundation 2025). This chapter narrows the claim to the repo: unsafe tools would make the lab unsafe even if the prompt says “be careful.”

Core Concept

A tool contract should specify:

input schema,
output schema,
allowed roots or resources,
mutability,
output caps,
error behavior,
trace fields.

The lab implements three read-only tools: list_files, read_file, and grep.

Capability Design

The safest tool is often the narrowest useful tool. A generic shell tool can do almost anything, which means the policy and approval burden is high. A narrowly scoped grep tool can search text but cannot mutate state, install packages, contact the network, or execute arbitrary commands. The difference is not implementation convenience; it is capability design.

Tool outputs also shape model behavior. If a tool returns a megabyte of logs, the next model call may be dominated by irrelevant context. If a tool returns unstable ordering, traces and tests churn. If a tool returns ambiguous errors, the runtime may retry incorrectly. The tool contract is therefore part of reliability, cost, and observability.

The repo’s tools are intentionally conservative:

paths must resolve under allowed roots,
returned file lists are sorted,
reads are capped,
shell execution is not a tool,
writes are not implemented in the core path.

That is not a statement that every production agent must be read-only. It is a statement that the first learning path should expose capability boundaries before mutation exists.

Error Contracts

Errors should be explicit enough to support policy and evals. ToolPolicyError is raised when a path or capability violates policy. In a larger system, the runtime might convert that exception into a trace event, a user-facing refusal, a human-approval request, or an eval failure.

The important design point is that the error is not just a Python exception. It is evidence that the runtime boundary worked.

Tool Review Checklist

Before exposing a tool to an agent, write down its abuse case. For read_file, the abuse case is reading outside the allowed root or reading too much. For grep, the abuse case is returning huge output or searching an unintended tree. For a future write_file, the abuse case is destructive mutation. For a shell tool, the abuse case is arbitrary execution.

The tool contract should make those abuse cases testable. A test should fail when traversal succeeds, when output caps are missing, or when shell execution is enabled without explicit policy. The local tests are small because the tools are small; production tools need proportionally stronger tests.

Case Study Step

The repo triage system only needs read tools. That design choice removes whole classes of risk from the first version. There is no write_file, no delete_file, no shell, and no network call. That constraint may look limiting, but it clarifies the lesson: once read-only tools are safe and observable, mutation can be designed deliberately instead of accidentally.

Tool Output as an API

Tool output is an API to the runtime and, in model-backed systems, to the model. That means output shape matters. grep returns file name, line number, and line text rather than a formatted blob. list_files returns relative paths rather than absolute local paths. read_file returns capped text with a truncation marker. These choices make downstream behavior easier to test.

Production tools should be equally intentional. A database query tool should not return arbitrary raw tables if the agent only needs a few fields. A ticket-search tool should include source IDs and confidence metadata. A log-search tool should include time ranges and truncation status. A browser tool should separate visible text from hidden markup. The more structured the output, the easier it is to trace, evaluate, and summarize.

This does not mean every tool needs a heavy schema framework. It means the output should be designed, not improvised.

Mutating Tools

The core repo omits mutating tools, but a real agentic system often needs them. The moment a tool can mutate state, the design bar rises. You need authorization, dry-run behavior, idempotence, confirmation, rollback, and audit logs. You also need a way to distinguish proposal from execution.

A useful pattern is two-phase operation:

produce a proposed action,
require an approval or deterministic validator before execution.

For example, a repo agent might propose a patch but not apply it. A deployment agent might propose a rollback but require human approval. A support agent might draft a reply but not send it. The runtime can still be agentic while the dangerous side effect remains gated.

Tool Granularity

Tool granularity is a design decision. A very broad tool is flexible but hard to govern. A very narrow tool is safe but may force the agent into too many calls. The right granularity depends on risk, latency, context, and audit needs.

For file systems, read_file and grep are narrow enough to test. For databases, a tool like run_sql may be too broad unless it is read-only, query-limited, and approval-gated. A safer alternative might expose specific analytical queries. For deployment systems, a generic API caller is too broad; a get_deployment_status tool and a separate request_rollback_approval tool are easier to review.

Good tool design makes the safe path easy and the risky path explicit.

Staff Practice Notes

Treat every tool proposal as a capability review. The most dangerous phrase is “just expose X so the model can figure it out.” Broad capability makes demos easier and operations harder. A staff engineer should push the design toward narrower tools, typed inputs, capped outputs, and explicit approval records before arguing about prompt quality.

Tool design is also product design. A tool that returns vague strings will produce vague traces, weak evals, and brittle reports. A tool that returns structured observations gives the whole system leverage. Invest in tool contracts early because every later layer depends on them.

Operational Invariants

Every tool call should be policy-mediated. A tool should not decide path authority, shell access, or mutation authority informally inside business logic. It should ask the policy layer first, record the decision, and fail with a structured error when the action is outside the boundary.

Every tool output should be bounded. Bounded output is a safety property and an inference property. It protects against accidental context floods, adversarial large files, and unstable report artifacts. A model that truly needs more data should request a narrower follow-up observation, not receive unbounded content by default.

Every tool should be deterministic unless nondeterminism is the feature. File listings should be sorted. Grep results should be ordered. Error shapes should be stable. Determinism makes tests useful and makes reports diffable. When a tool must call an external or nondeterministic service, the trace should record the parameters that explain variance.

The Lab

python -m agentic_systems_lab.tools

Reading the Lab Output

The command prints visible files under buggy_calc. The output is intentionally plain. Tool output should be boring when the task is boring. A file-listing tool should not infer importance, rewrite paths, or include local absolute paths unless the contract says so.

The stronger evidence lives in the tests. tests/test_tools.py verifies deterministic ordering, capped reads, structured grep matches, and path blocking. For a tool boundary, tests are more convincing than a pretty demo.

If the demo output surprises you, do not fix the display first. Inspect the tool contract. Are paths relative to the allowed root? Are directories sorted? Are long outputs capped with a visible marker? Are errors structured enough for an agent runtime to handle? A confusing tool demo usually means the interface is underspecified.

Code Walkthrough

The important boundary is not the filesystem call. It is policy resolution:

file_path = active_policy.resolve_path(path)
text = file_path.read_text(encoding="utf-8", errors="replace")
return active_policy.cap_text(text, max_chars=max_chars)

The tool does not decide whether a path is safe; ToolPolicy does. That separation makes it possible to test path traversal, read-only behavior, and output caps independently.

list_files follows the same pattern: resolve the root under policy, walk only permitted paths, and return a sorted relative list. Sorting is a small but important choice. It makes test output deterministic, keeps reports stable, and prevents incidental filesystem ordering from becoming part of the lab behavior.

grep returns structured dictionaries instead of raw shell output. That shape matters because later evals and reports can reason about file, line, and text fields. A model may eventually read those results, but the runtime should not depend on prose parsing to understand its own tool output.

Notice what is missing: no shell execution, no glob expansion delegated to a shell, no implicit current-working-directory authority, and no mutation helper. Those omissions are design decisions. The tool layer exposes only the capabilities needed for the lab, so policy remains understandable.

Expected Output

The command prints the visible files under data/toy_repos/buggy_calc. The ordering is deterministic so tests and traces do not churn.

The absence of absolute paths is part of the expected output. The reader should see repository-relative names, not local machine details. That keeps artifacts portable and avoids teaching downstream code to rely on environment-specific paths.

Failure Mode

If read_file accepted arbitrary paths, a malicious or confused agent could read outside the repository. If grep returned unbounded output, a noisy fixture could inflate context and cost.

The symptom is not always an obvious exploit. It may appear as a harmless convenience: let the agent read any file so debugging is easier, return full grep results so the model has more evidence, or expose shell because it is faster than designing narrow tools. Each convenience expands authority and makes behavior harder to reason about.

The root cause is treating tools as implementation details. When a model can call a tool, the tool becomes part of the runtime boundary. Path normalization, root confinement, output caps, structured errors, and trace fields are not polish. They are the mechanism that turns arbitrary code access into a reviewable capability.

The artifact that exposes the failure is a policy or tool test that exercises the boundary. A traversal test proves path confinement. A max-output test proves context control. A blocked-shell test proves that broad execution is not accidentally available. If the only evidence is “the prompt told the model not to do that,” the boundary is not enforceable.

Production Translation

Treat every tool as a production capability. If a tool can send email, delete files, run shell commands, or call internal services, it needs an authorization model, trace evidence, approval gates, and rollback semantics.

For a staff review, require a tool inventory. For each tool, ask:

What resource can it observe or mutate?
Is the action idempotent?
What is the maximum output size?
What identity or tenant boundary applies?
What trace fields prove what happened?
What approval is needed for destructive actions?
What eval catches misuse?

This inventory should exist before an LLM is allowed to call the tool.

For rollout, treat tool exposure as a change-management event. A new read-only tool might require tests, trace fields, and output caps. A mutating tool should require dry-run mode, approval artifacts, rollback, and post-action verification. A broad tool such as shell, browser, database, or cloud API should be decomposed into narrower actions before it reaches production. The operating principle is least authority with evidence: give the runtime the smallest capability that can solve the task, then record enough evidence to prove how it was used.

Design Review Questions

For every proposed tool, ask:

What is the narrowest useful capability?
Is the tool read-only, mutating, or externally side-effectful?
What input validation applies?
What output cap applies?
What error contract does the runtime receive?
What trace fields are recorded?
What abuse case does the test suite cover?
What approval is required before execution?

If a tool cannot be described this way, it is not ready to expose to an agent.

Review Rubric

Reject tools that expose broad authority without a narrow contract: arbitrary shell, unrestricted filesystem reads, unbounded output, or mutation without approval.

Require review when a tool is read-only but returns large or ambiguous output. Read-only does not mean risk-free if the output can flood context or confuse downstream evals.

Accept a tool when policy checks happen before execution, outputs are bounded and structured, ordering is deterministic, and abuse-case tests cover the boundary.

Implementation Notes

Add tools in three steps. First, write policy tests for what the tool must not do. Second, write behavior tests for deterministic successful output. Third, add trace coverage so the runtime can reconstruct calls. Only after those steps should a model be allowed to select the tool.

For high-risk tools, add a dry-run mode before execution mode. The dry-run output should be structured enough for approval and evals.

Extension Path

The natural next tool is not write_file; it is propose_patch. The tool would accept target path, current content hash, and proposed diff, then return a structured patch artifact without applying it. Tests can validate path confinement, diff shape, hash mismatch behavior, and report rendering.

After propose_patch is stable, add an approval record and only then consider an apply step. This sequence preserves the distinction between reasoning, proposing, approving, and mutating. Each boundary can have its own trace event and failure mode.

Worked Scenario: Adding a Patch Tool

Assume the repo triage bot should eventually propose fixes. The tempting implementation is a write_file tool. The safer design starts with propose_patch. That tool returns a diff artifact but does not apply it. The report can include the diff, tests can validate the schema, and a human can approve before mutation.

Only after the proposal path is reliable should the system add an apply step. Even then, the apply step should require an exact patch, target path, expected current content hash, and approval record. This separates reasoning from mutation. It also gives the runtime a rollback story: the proposed patch and applied patch are artifacts, not hidden side effects.

The lesson generalizes beyond files. For tickets, propose a reply before sending it. For infrastructure, propose a rollback before executing it. For databases, propose a migration before applying it.

Chapter Synthesis

Tool design is where agent architecture becomes concrete. A prompt can describe good behavior, but tools define what the runtime can actually observe or change. That makes tool interfaces security boundaries, reliability boundaries, and context boundaries at the same time.

The chapter’s reusable lesson is to design tools for audit before power. Narrow authority, deterministic output, structured errors, and policy mediation are not constraints on intelligence. They are the conditions that let intelligent behavior be safely evaluated.

Evidence and References

Security framing is grounded in OWASP’s LLM risk taxonomy (OWASP Foundation 2025). Tool behavior is grounded in tests/test_tools.py and tests/test_policy.py.

Takeaways

Tools are capability boundaries and should be reviewed as such.
Bounded, structured, deterministic output makes later evals and reports stronger.
Mutation should arrive through proposal, approval, execution, and verification stages.

Exercises

Add a blocked traversal test for ../. Verify that the failure is a policy violation rather than an uncaught filesystem exception.
Add a max-output-size test for grep. Include enough fixture content to prove the cap is enforced deterministically.
Design a write_file schema but keep it approval-gated. Include path, expected current hash, patch or content, approval identity, and dry-run output.
Define the trace fields required for a tool call. Include tool name, normalized arguments, policy decision, output size, latency, success, and error class.
Design a narrow replacement for shell execution that solves one real debugging need without exposing arbitrary commands.
Write an abuse-case test for each tool: traversal for file reads, excessive output for search, and unsupported mutation for policy.
Extend the tool inventory with risk tier, owner, and rollback strategy. Explain what evidence would let you lower or raise the tier.
Review one production tool you have used and rewrite its interface as if an untrusted model could call it.

Checklist

Tools are capabilities, not helpers.
Read-only should be the default.
Output size is part of the safety contract.
Tool errors should be structured and testable.
Policy decisions should be traceable per call.
Broad tools need stronger approval than narrow tools.
Dry-run artifacts should precede mutating capabilities.
A model should never receive authority that the runtime cannot audit.

OWASP Foundation. 2025. OWASP Top 10 for Large Language Model Applications. https://owasp.org/www-project-top-10-for-large-language-model-applications.