9 Guardrails and Tool Policy

Learning Objective

After this chapter, you should be able to distinguish prompt-level guardrails from runtime policy and explain why risky capabilities need enforcement outside the model.

Why This Matters

Prompt instructions are not a capability boundary. If a tool can read arbitrary paths or execute shell commands, the system has that capability regardless of the prompt. OWASP documents prompt injection and excessive agency among LLM application risks (OWASP Foundation 2025). This lab narrows the lesson: enforce path and shell boundaries in code.

Core Concept

ToolPolicy defines:

allowed roots,
read-only mode,
shell disabled by default,
maximum file characters,
approval-required actions,
violation records.

Policy is checked before tools access files or shell-like capabilities.

Prompt Guardrails vs Runtime Guardrails

Prompt guardrails can be useful. They can state intent, label tool output, and remind the model of constraints. But they are not sufficient for capability control. Runtime guardrails decide what the process can actually do.

In this repo, the model is absent from the core path, so the distinction is visible. A path outside the allowed root fails because resolve_path rejects it. Shell execution fails because check_shell rejects it. Those outcomes do not depend on model cooperation.

The design principle is simple: use prompts to guide behavior, but use code to enforce boundaries.

Policy as Audit Evidence

The policy is logged in the trace with policy_check. That makes the run reviewable. If a future run allowed shell execution, the trace would show a different policy. If a path traversal was attempted, the policy would record a violation.

Policy evidence is especially important when capabilities change over time. A production incident review needs to know not only what the agent did, but what it was allowed to do at the time.

Policy Drift

Policy drift happens when capabilities expand faster than review. A path allowlist grows. Output caps increase. A write tool is added for convenience. A shell tool is enabled for debugging and never removed. Each change may look reasonable alone, while the combined system becomes much riskier.

The mitigation is change review. Treat policy diffs like permission diffs. Require tests and sample traces for expanded authority.

Case Study Step

The prompt-injection fixture asks the system to read outside the repository. The policy boundary makes that request irrelevant to file access. The malicious text is still useful evidence; it is just not authority. This is the difference between recognizing an attack string and being controlled by it.

Approval Design

Approval should be tied to action class, not vague risk feeling. A read of a public fixture can be automatic. A read of a private repository may require user authorization. A write to a branch may require maintainer approval. A delete or shell command may require a stronger gate or may be prohibited entirely.

The policy schema in this repo includes approval_required even though the core tools do not implement writes. That is a forward-looking design hook. It reminds the reader that the absence of mutation is a first-version constraint, not a permanent limit.

For a production system, approval records should include who approved, what was approved, what evidence was shown, and what exact action was executed. Otherwise approval becomes a ritual rather than an audit artifact.

Policy Tests as Security Regression Tests

Policy tests are not only correctness tests. They are security regression tests. If a future refactor accidentally allows path traversal, the test should fail. If a developer enables shell execution for convenience, the test should fail unless the policy and tests are deliberately updated.

The local tests cover a small surface: path confinement, read-only blocking, shell blocking, and output caps. That is enough for this repo’s capability set. A richer tool set should grow the policy tests before exposing new authority.

Human Approval UX

Human approval is only useful if the human receives the right evidence. “Allow this action?” is weak. A useful approval prompt includes action type, target resource, proposed diff or command, evidence summary, policy reason, and rollback path.

For example, approving a file write should show the path, diff, test status, and why the agent wants the change. Approving shell should show the exact command, working directory, environment assumptions, and risk. If the approver cannot understand the action, the approval flow is cosmetic.

Staff Practice Notes

When reviewing guardrails, ask which controls still work if the model ignores them. Prompt guidance is useful, but it is not the same as a runtime boundary. A staff engineer should be comfortable saying that a prompt-only guardrail is not sufficient for a tool with real authority.

Policy should also be reviewed for expansion pressure. Teams often add broad authority while debugging and forget to remove it. Serializing policy into traces and reports makes that drift visible. Treat authority changes as code changes with tests, owners, and rollback.

Operational Invariants

Policy should fail closed. If the runtime cannot prove that a path is inside an allowed root, a command is permitted, or a write has approval, the action should not proceed. A blocked action with evidence is a healthy outcome; an ambiguous action that succeeds is not.

Policy should be serializable. The active policy for a run should appear in traces and reports so reviewers can compare behavior across environments. Without serialization, “the policy was enabled” becomes a claim rather than evidence.

Policy should be tested at the boundary, not only through happy-path tools. Traversal attempts, blocked writes, blocked shell calls, and output caps are the tests that prove authority is constrained. A policy suite that only tests allowed behavior is incomplete.

The Lab

python -m agentic_systems_lab.policy

Reading the Lab Output

The policy JSON is a capability summary. read_only: true means write helpers should be blocked. allow_shell: false means shell execution should be blocked. allowed_roots defines the observation boundary. max_file_chars defines the default output cap.

If this output changes, the system’s authority changed. Treat that as a security-relevant diff.

The most important policy output is often the one you do not see in a happy path: violations. A production report should make denied actions visible because they show pressure against the boundary. A quiet run is not necessarily safer than a run with blocked attempts; it may simply have exercised less of the system.

Code Walkthrough

resolve_path permits only paths under allowed roots. check_write blocks mutation under read-only mode. check_shell blocks shell execution unless explicitly enabled.

The important implementation detail is that policy methods record violations before raising. This makes denied behavior visible to reports and tests. A blocked action is not just an exception; it is evidence that the runtime boundary was exercised.

resolve_path should be called before opening files, not after. check_write should be called before constructing side effects. check_shell should be called before executing anything. The order is part of the security property. A policy check that happens after partial execution is audit decoration, not enforcement.

The policy object also serializes to a dictionary for traces and reports. That makes active authority inspectable. If a future run broadens roots or enables shell, the artifact diff should show it. Policy that cannot be serialized is harder to review.

Expected Output

The command prints JSON with the default allowed root, read_only: true, allow_shell: false, and approval-required actions.

This output is a capability declaration. If allow_shell changes to true or allowed_roots expands, the runtime’s authority changed even if no tool call has used that authority yet. Review policy output before reviewing model behavior.

Failure Mode

If path checks are only prompt instructions, malicious or confused model behavior can still request unsafe paths. If shell access is exposed without approval, the blast radius becomes much larger than a read-only file lab.

The symptom is misplaced trust. The prompt says “only read files in this repository,” but the runtime accepts a path outside the root. The prompt says “do not run shell,” but the tool exists and executes commands. The model may behave well during demos, yet the system has already granted authority it cannot justify.

The root cause is confusing behavioral guidance with enforcement. Guardrail language can improve model behavior, but it does not replace runtime policy. A policy must run before the tool operation, record its decision, and fail closed when authority is unclear. The strongest prompt cannot make an unsafe tool safe if the runtime still executes it.

The artifact that exposes the failure is a violation record. A blocked traversal or blocked shell attempt should be visible in trace and report data. That evidence proves the boundary exists and gives reviewers a chance to understand attempted behavior. Silent denial is better than unsafe execution, but recorded denial is better for operations.

Production Translation

Production guardrails should include policy enforcement, not just model instructions. Risky tools need explicit approval semantics, audit logs, and narrower capability scopes.

Approval is not binary. A low-risk read may be automatic. A write may require user confirmation. A delete may require maintainer approval. A network call may require tenant-aware egress policy. The local policy schema is small, but it is designed so those distinctions can be added rather than bolted on later.

Policy also needs rollout stages. Local development may allow broader inspection than CI. CI may allow fixture roots but no user data. Shadow mode may allow production reads but no writes. Production mutation may require approvals and rate limits. Each stage should serialize its active policy into traces and reports. Otherwise the team cannot tell whether a safe test run and a risky production run actually used the same boundary.

Design Review Questions

For guardrails and policy, ask:

Which controls are prompt guidance?
Which controls are runtime enforcement?
What capability does each tool expose?
What policy blocks unintended authority?
What violation is logged?
What approval flow exists for risky actions?
What policy changes require security review?
What test fails if the boundary weakens?

The safest policy is one whose failure mode is visible in tests, traces, and reports.

Review Rubric

Reject policy designs that rely on prompt wording as the enforcement boundary. The runtime must constrain tools even when model behavior is wrong.

Require review when policy exists but is not serialized, tested at abuse boundaries, or represented in reports. Invisible policy is hard to audit.

Accept the policy when it fails closed, records violations, distinguishes read/write/shell authority, supports approval semantics for risky actions, and is visible in trace and report artifacts.

Implementation Notes

A production implementation should load policy from a versioned file rather than hard-coding defaults. That file should be reviewed like source code. Changes to allow_shell, write permissions, network access, or allowed roots should be obvious in diff review.

The local object model is still useful: it defines the fields a file-based policy would need.

Extension Path

Add environment-specific policy profiles: local, CI, shadow, and production. Each profile should serialize allowed roots, read-only status, shell status, output caps, and approval requirements. Tests should prove that risky capabilities do not leak from local development into CI or production.

The extension is valuable even without adding new tools. It teaches that policy is not one global constant. Authority changes by environment, and those changes must be visible in artifacts.

Worked Scenario: Debug Shell Temptation

A developer debugging the triage bot may want to add a shell tool. It would make exploration easy: run tests, list directories, inspect environment. It would also change the system’s authority dramatically.

The policy review should force that change into the open. Is shell needed in production, or only during development? Can a narrower tool solve the problem? Should shell be read-only? What commands are allowed? Is approval required? What gets logged? Without answers, shell access should stay disabled.

The point is not that shell is always forbidden. The point is that shell is too broad to appear accidentally.

Chapter Synthesis

Policy is the runtime answer to a security question: what can the system actually do? Prompt instructions may guide model behavior, but policy constrains tool authority. This chapter makes that distinction executable through path roots, read-only mode, shell blocking, output caps, and violation records.

The chapter also reframes blocked actions as evidence. A violation is not just an error; it is proof that the boundary was exercised. Production systems should preserve that proof because it tells reviewers where pressure against the boundary appeared.

Evidence and References

Security risk categories are grounded in OWASP (OWASP Foundation 2025). Local enforcement is tested in tests/test_policy.py.

Takeaways

Prompt guidance is not runtime enforcement.
Policy should fail closed, serialize active authority, and record violations.
Approval is an artifact for a specific risky action, not a general feeling of permission.

Exercises

Try to read outside the allowed root. Verify the result is recorded as a policy violation with enough metadata for review.
Add a policy violation report section. Group violations by tool, path, decision, and run ID.
Design an approval flow for write_file. Include dry run, exact diff, approver identity, expiry, and post-apply verification.
Decide which tools should never be exposed. Explain whether the reason is authority, ambiguity, irreversibility, or auditability.
Write a policy test for a tool that is read-only in one environment and approval-gated in another.
Design a policy record schema that can distinguish denied, allowed, allowed-with-warning, and requires-approval decisions.
Review a broad capability such as shell, browser, database, ticket update, or cloud API. Replace it with the narrowest capability that would satisfy one use case.
Define the operational owner for policy changes and the evidence required before expanding a tool allowlist.

Checklist

Prompt text is not a security boundary.
Runtime policy should enforce capability limits.
Violations should be logged as evidence.
Policy should be evaluated before tool execution.
Approval should be an artifact, not a chat message.
Environment-specific policy must be explicit.
Tool authority should expand only with tests and review.
Denied actions are useful evidence and should not disappear.

OWASP Foundation. 2025. OWASP Top 10 for Large Language Model Applications. https://owasp.org/www-project-top-10-for-large-language-model-applications.