2  From LLM Calls to Agentic Systems

Learning Objective

After this chapter, you should be able to classify a system as a model call, deterministic workflow, tool-using system, agent, or multi-agent system, and identify which engineering boundaries make it production-shaped.

Why This Matters

A single model call can be useful, but it does not tell you what the system is allowed to do, what evidence it used, what state it changed, how success is measured, or how a bad run is rolled back. The OpenAI Agents SDK documentation describes agentic applications in terms of models using tools and context, handoffs, streaming, and traces (OpenAI 2025). Anthropic’s engineering guidance distinguishes workflows with predefined code paths from agents that direct their own processes and tool use (Anthropic 2024). Those definitions are useful, but this book uses a stricter systems lens: autonomy is only one dimension.

Core Concept

A model call is an inference request. A workflow is fixed orchestration. A tool-using system gives the model or runtime access to capabilities. An agent chooses some part of its path at runtime. A production-shaped agentic system adds boundaries and evidence around that behavior.

The core arc of this repo is:

build -> tool access -> state/context -> trace -> eval -> policy -> report -> deployment gate

This is an author interpretation, not a vendor definition: the model is not the system. The system is the model plus runtime, tools, state, traces, evals, policies, and rollout controls.

A More Precise Vocabulary

The distinction matters because teams often use “agent” to describe very different things. A single prompt submitted through a chat interface is a model interaction. A batch job that sends fixed prompts over a dataset is a workflow with model calls. A service that lets a model choose from tools is an agent-shaped runtime, but it is not automatically production-shaped. A multi-agent system introduces multiple runtimes or roles that coordinate, hand off, debate, route, or specialize. Each step adds a new failure surface.

This book treats autonomy as a design variable, not as a maturity level. More autonomy can be useful when the task path cannot be known in advance, but it also means the runtime must record more evidence. If a system can decide which file to inspect, then the trace should show the selected file. If it can call tools, the policy should define what tools are allowed. If it can accumulate context, the context profiler should show whether that context is bounded. If it can make a recommendation, the eval should say what would make the recommendation acceptable.

The local lab intentionally begins with a toy repository. That keeps the observation surface small enough for inspection. The lesson is not that real systems are toy-sized. The lesson is that a production-shaped system can be studied only when the boundaries are visible.

Boundary Map

Think of an agentic system as a set of nested contracts:

user/task contract
runtime contract
tool contract
state contract
trace contract
eval contract
deployment contract

The user/task contract defines what the system is trying to do. The runtime contract defines how execution proceeds. The tool contract defines what capabilities exist. The state contract defines what is authoritative and what is merely context. The trace contract defines what evidence is saved. The eval contract defines what success means. The deployment contract defines what is allowed to ship.

When a system fails, the useful question is not immediately “Did the model fail?” The useful question is: which contract failed? If the agent read the wrong file, the failure may be file selection. If it found the right evidence but returned malformed JSON, the failure may be output validation. If it attempted to read outside the repo, the policy should catch it. If the answer looks right but no trace exists, the failure is observability.

Review Heuristics

When you review an agentic-system proposal, ask the proposer to draw the system without the model first. The diagram should still contain task intake, tools, state, context assembly, traces, evals, policy, reports, and deployment gates. If the diagram collapses into “call the model,” the system is not yet designed.

Then ask which parts are deterministic. Most production systems have more deterministic structure than their demos suggest. Authentication, data fetching, policy, validation, trace writing, report generation, and rollout checks should not depend on model creativity. The model can be central to interpretation, summarization, ranking, or planning, but the runtime should still own the contracts around it.

Running Case Study

The book uses one small case study repeatedly: a repo triage system. In a real company, this might inspect pull requests, failing tests, logs, or issue reports. In this lab, it inspects toy repositories. The simplification is intentional. The point is to show that even a tiny agent-shaped system already needs boundaries.

buggy_calc gives the system a benign diagnosis task. prompt_injection_repo gives it untrusted content. noisy_logs_repo gives it context pressure. Together they form a compact failure matrix: task correctness, safety, and infrastructure cost.

What Counts as Production-Shaped?

This book uses “production-shaped” carefully. It does not mean the repository is production-ready. It means the local system already has the same categories of artifacts that a production system would need: policy, traces, evals, context accounting, and reports. The scale is small, but the shape is deliberate.

A production-shaped learning repo should make future hardening possible without rewriting the whole architecture. If you later add an LLM-backed decision step, the tool policy should still apply. If you add a hosted model, the context profiler should still expose prompt growth. If you add a write tool, the report should still surface approval requirements. If you add a richer eval suite, the output schema should still make deterministic checks possible.

That is the reason this book resists vague chatbot examples. A chatbot transcript can teach interaction design, but it often hides runtime boundaries. The repo triage case study keeps the runtime visible. You can inspect the tools. You can read the policy. You can open the trace. You can see the eval report. You can render the book. Those artifacts create a stronger learning loop than a fluent demo answer.

Reader Contract

The reader is expected to bring engineering judgment. The book will not explain what JSON is, how Python functions work, or why tests matter. Instead it asks sharper questions: what would falsify the claim that the agent inspected evidence? What artifact proves the policy was active? Which output fields make evals possible? Which context segments are dynamic? Which deployment gate would block a risky run?

The payoff is speed. A technical reader should be able to map each chapter to a concrete artifact and then adapt the pattern to a real system.

Common Maturity Trap

A team can appear to advance by adding a stronger model while leaving the system boundary unchanged. That may improve some answers, but it does not add observability, evals, or policy. In the vocabulary of this book, that is model improvement, not system maturity.

System maturity shows up when failures become easier to analyze. If a run is bad, can the team say which files were inspected? Can it say whether the model saw untrusted instructions? Can it identify the first cache-breaking segment? Can it reproduce the eval result? Can it explain why a deployment gate blocked? Those questions define the difference between a powerful demo and a supportable system.

This is why the book treats “agentic” as an engineering burden rather than a badge. Dynamic behavior can be valuable, but it makes evidence more important, not less.

Staff Practice Notes

In staff-level review, insist on vocabulary that predicts engineering work. “Agent” should imply runtime decisions, tools, state, evidence, and gates. If the term only means “LLM call with a longer prompt,” the design review should say so. Precise vocabulary protects roadmaps because it stops teams from smuggling infrastructure work into a feature estimate.

Also separate product value from architectural novelty. A model call may be enough for a low-risk drafting feature. A workflow may be enough for a known compliance check. An agent may be justified for runtime evidence selection. Multi-agent systems may be justified for separable roles or independent review. The right abstraction is the smallest one that makes the product behavior measurable and supportable.

Operational Invariants

The first invariant is artifact continuity. A useful agentic system should leave behind enough structured evidence for another engineer to reconstruct the run without relying on the operator’s memory. That evidence does not have to be elaborate at the beginning, but it must be deliberate: input, policy, tool calls, context observations, output schema, eval result, and report status.

The second invariant is authority separation. The component that produces language should not be the component that silently grants capabilities. Tools, policies, approvals, and deployment gates live outside the model. This separation is what lets the system remain reviewable when the model output is fluent, incomplete, or wrong.

The third invariant is falsifiability. Every important claim about the system should have a way to fail. If the claim is “the system inspected the relevant files,” an eval should fail when it does not. If the claim is “the system stayed inside policy,” a violation count or trace event should expose the contrary. A system that cannot be falsified cannot be operated rigorously.

The Lab

Run the deterministic baseline:

python scripts/run_all_examples.py --example workflow_baseline

Reading the Lab Output

Do not read the baseline output as a user-facing answer. Read it as evidence inventory. The files array says which repository files were visible. The division_mentions array says where the workflow found relevant text. That is enough to establish what the runtime could observe before any diagnosis step exists.

This is the first habit of the book: separate observation from interpretation. Observation says “these lines exist.” Interpretation says “there is a contract mismatch.” The deterministic baseline intentionally stops before interpretation so the later agent result can be compared against it.

When reviewing your own output, look for three things: scope, order, and absence. Scope tells you what the runtime could see. Order tells you whether output is deterministic enough to diff. Absence tells you what the baseline did not do: no diagnosis, no mutation, no hidden memory, no model call. Those absences are useful because they make later additions visible.

Code Walkthrough

The baseline uses list_files and grep against data/toy_repos/buggy_calc. There is no LLM and no autonomy. That is the point: the first artifact is a control path that can be inspected and tested.

Read the baseline as a systems-design object, not as a clever algorithm. It establishes the minimum useful evidence path: confine the repository root with policy, list deterministic files, search for relevant terms, and return structured observations. The result is not powerful, but it is stable. Stability is what makes later agent behavior measurable.

The important engineering move is that the same tool layer can serve both workflow and agent paths. list_files and grep already have path confinement and deterministic ordering, so the first chapter does not need a model to teach a production lesson. It teaches that tool behavior can be specified before model behavior exists.

When this baseline becomes a control group, regressions become easier to interpret. If the baseline fails, inspect tools or fixtures. If the baseline passes and the agent fails, inspect the dynamic layer. That separation is one reason the repo starts with a deliberately small example.

Expected Output

The output should include README.md, calculator.py, and test_calculator.py, plus matches for divide or division. This is repo-observed behavior, not a claim about agents in general.

This output proves observability, not diagnosis. It shows that the runtime can see the relevant files and lines under policy. It does not prove that a model can reason over them, that the finding is complete, or that deployment is safe. Keeping that proof boundary narrow is the discipline the rest of the book builds on.

Failure Mode

A transcript can look successful while the system is not debuggable, measurable, or safe. Without traces, evals, and policy, a team cannot reconstruct which evidence was used or which tool boundary mattered.

The common symptom is a demo that produces a strong final paragraph but leaves no durable trail. The team can quote the answer, but it cannot answer basic review questions: which files were inspected, which tool calls failed, whether the output schema was validated, whether the model relied on stale memory, or whether the run crossed an authority boundary. The demo is persuasive to watch and weak to operate.

The root cause is usually category error. The team evaluates a model call as if it were a system. A model response is one component; an agentic system also needs runtime control, tool confinement, context assembly, trace emission, evals, and rollout gates. If those pieces are absent, the system’s quality cannot be inferred from a single successful answer.

The artifact that exposes the failure is absence itself. No trace means no process evidence. No eval means no regression claim. No policy record means no capability boundary. No report means no release decision. A staff-level review should treat those missing artifacts as architectural facts, not as documentation tasks.

Production Translation

Before deployment, a team needs explicit answers to five questions:

  • What is the system allowed to observe or mutate?
  • What did it actually do?
  • Which artifacts measure success?
  • Which failure modes are bounded?
  • What rollback or review path exists?

For a staff-level review, those questions should become artifacts. “The agent is safe” is too broad. “The read-only file tools are confined to data/toy_repos, path traversal is tested, shell is disabled, and policy violations are reported” is reviewable. “The agent works” is weak. “The buggy_calc, prompt_injection_repo, and noisy_logs_repo evals pass, while the report still requires human review due to large dynamic context” is stronger.

The book repeatedly uses this pattern: make the claim narrower, tie it to an artifact, and identify what would falsify it.

In a production review, this becomes a contract table. Each system claim should have an owner, artifact, gate, and rollback. “Reads repository files” maps to tool tests and path-policy traces. “Diagnoses division-by-zero mismatches” maps to fixture evals. “Handles prompt injection safely” maps to the injection fixture and violation count. “Is ready for automatic PR comments” maps to deployment status, approval policy, and monitoring. The table does not make the agent good by itself, but it prevents the review from becoming a loose argument about model capability.

Design Review Questions

Use these questions before approving an agentic-system design:

  • Which decisions are deterministic, and which decisions are delegated to a model?
  • What capabilities can the runtime access?
  • What state is durable, and what state is prompt-only?
  • What trace evidence proves the path taken?
  • What evals would fail if the model gave a plausible but unsupported answer?
  • What policy prevents the system from exceeding its intended authority?
  • What deployment gate blocks rollout when evidence is missing?

If the design cannot answer these questions, the missing answers are not documentation gaps. They are system-design gaps.

Review Rubric

Reject a design that equates “uses an LLM” with “is an agentic system,” or that presents final-answer quality as the only evidence. That design has not specified the runtime.

Require review when the design has tools or traces but no evals, or evals but no policy. Partial evidence is useful, but the missing artifact should limit rollout scope.

Accept the first version only when the autonomy boundary, tool authority, trace path, eval gate, and rollback or review path are explicit. The model can be simple; the system contract should not be vague.

Implementation Notes

When implementing a new agentic feature, create the artifact skeleton first. Define the output schema before writing prompts. Define the trace events before adding tool calls. Define policy before exposing capabilities. Define eval fixtures before declaring the demo successful. This order keeps the system inspectable as it grows.

In practice, the implementation order should feel conservative: workflow, tests, tools, policy, trace, eval, report, then model behavior. The model enters a system that already knows how to observe itself.

Extension Path

The next extension is not “add a smarter model.” It is to make the existing artifact graph stricter. Add a schema version to the final output, add trace completeness checks, and make the production report fail when required artifacts are missing. Those changes deepen the system contract without changing the model surface.

Only after that contract is stable should you add a model-backed diagnosis step. The model should enter as one strategy behind the same schema, tools, policy, traces, and evals. If adding the model requires weakening those artifacts, the system is not ready for the experiment.

Worked Scenario: A Pull Request Triage Bot

Imagine the local repo triage agent as a pull request bot. The weak version reads a diff, asks a model for comments, and posts whatever comes back. The stronger version has a workflow baseline, read-only repository tools, path policy, trace events, deterministic evals, context budgeting, and a report. The model may still produce the final diagnosis, but the system around it can be inspected.

In design review, the difference is stark. For the weak version, the reviewer asks “why did it make this comment?” and receives a prompt template. For the stronger version, the reviewer can inspect which files were read, what policy was active, what evidence entered context, whether the eval suite passed, and why the deployment gate allowed or blocked the run.

That is the central argument of the book in one scenario. A model can produce a useful sentence. A system produces a reviewable run.

Chapter Synthesis

The chapter’s core move is to replace vague agent language with operational surfaces. A system is not mature because it produces fluent language; it is mature when its decisions, authority, evidence, and gates can be inspected. The deterministic baseline is intentionally small because the first lesson is not intelligence. It is reviewability.

Carry this synthesis forward: every later chapter adds one surface to that reviewable system. Tools add capabilities. Policy constrains them. Traces record behavior. Evals measure contracts. Context profiling exposes inference pressure. Reports turn evidence into deployment decisions. The book is cumulative by design.

Evidence and References

Agent/workflow distinctions are grounded in Anthropic and OpenAI documentation (Anthropic 2024; OpenAI 2025). The repo behavior is grounded in the command above and its generated JSON.

Takeaways

  • Treat “agent” as a runtime claim, not a branding claim.
  • Require artifacts for authority, evidence, measurement, and rollout.
  • Start with the smallest system whose behavior can be falsified.

Exercises

  1. Classify five systems you know as model calls, workflows, agents, or multi-agent systems. For each classification, name the runtime property that determined your answer: fixed path, dynamic tool selection, shared state, delegation, or multi-party coordination.
  2. Pick one existing model-call feature and sketch the smallest deterministic workflow that could replace it. Include inputs, outputs, validation, and failure handling.
  3. Identify the first boundary where dynamic decision-making creates value. Write the counterargument: why a fixed workflow might still be sufficient.
  4. Write one deployment gate for a tool-using system. Specify the artifact it reads, the condition it checks, and the status it emits when evidence is missing.
  5. Design the minimal trace schema for incident review of a one-step agent. Include at least run identity, input summary, tool calls, output validation, latency, and failure state.
  6. Write an eval case that would fail if the model produced a plausible answer without evidence. State which file, trace event, or output field proves grounding.
  7. Rewrite a vague product request such as “build an agent that reviews repos” into a concrete system contract with supported inputs, unsupported inputs, output schema, and rollout boundary.
  8. Compare the local repo triage system to one production agent you have seen. List which production concerns are absent from the lab and which abstractions still transfer.

Checklist

  • Do not equate a model call with an agentic system.
  • Start from deterministic behavior where possible.
  • Treat observability and evals as part of the system, not as later polish.
  • Define the output schema before prompt or model selection.
  • Require evidence for every user-facing diagnosis.
  • Make the first autonomy boundary explicit in design review.
  • Keep a deterministic baseline as the control group for experiments.
  • Treat deployment gates as software contracts, not review meeting opinions.