3 Workflows Before Agents
Learning Objective
After this chapter, you should be able to justify a deterministic workflow baseline before adding model autonomy.
Why This Matters
Autonomy is not free. It increases the behavioral surface: file selection, stopping criteria, context growth, malformed outputs, and policy violations all become runtime concerns. Anthropic’s guidance recommends workflows when the control path can be predefined and agents when the system needs flexible, model-directed behavior (Anthropic 2024). This lab turns that guidance into a reproducible baseline.
Core Concept
A workflow has a known path. The workflow in this repo:
- lists repository files,
- searches for domain terms,
- returns structured observations.
It does not infer intent, choose tools dynamically, or decide when to stop. That makes it easy to test and easy to falsify.
Why the Baseline Should Be Boring
A good baseline should feel almost disappointing. It should not be clever. It should not depend on hidden model behavior. It should make the task boundary so explicit that failure is easy to localize.
For the buggy_calc fixture, the deterministic workflow cannot write a rich bug report. It can list files and show where division-related text appears. That is enough to establish the evidence surface. The next chapter can then ask whether a tool-using runtime has a safe enough tool boundary. The agent chapter can ask whether structured diagnosis adds value. The eval chapter can ask whether the diagnosis is measurable.
This staged approach prevents a common error: using an LLM to blur requirements that have not been specified. If a system owner does not know what evidence is required, adding a model does not solve that problem. It usually makes the uncertainty harder to see.
Decision Boundary
Use a workflow when:
- the control path is known,
- the relevant data sources are known,
- the output schema is known,
- failures can be enumerated,
- deterministic tests can cover the behavior.
Consider an agent-shaped runtime when:
- the relevant data source is unknown at runtime,
- the next tool depends on an observation,
- the task requires iterative narrowing,
- the stopping condition depends on evidence,
- the value of dynamic selection outweighs the cost of variance.
These are design heuristics, not laws. The point is to make the autonomy decision explicit.
Baseline Comparison
The deterministic workflow should not be discarded after the agent exists. It remains a regression baseline. If the agent cannot outperform the workflow on a defined dimension, the workflow may be the better production implementation.
Useful comparison dimensions include:
- evidence coverage,
- output schema validity,
- latency,
- token estimate,
- policy violations,
- report status,
- operational simplicity.
This is where many agent demos become weaker under scrutiny. The agent may produce richer prose, but if it is slower, less grounded, harder to debug, and not measurably better, the deterministic path deserves priority.
Case Study Step
For buggy_calc, the deterministic workflow does not need to “understand” the repo. It only needs to expose the evidence surface. It proves that calculator.py, test_calculator.py, and README.md are visible under policy and that the division contract appears in the fixture.
That baseline gives the agent chapter a concrete question: can an agent-shaped runtime turn the same evidence into a structured diagnosis without losing traceability?
Designing the Baseline Artifact
A useful workflow baseline should return structured output, even if that structure is simple. In this repo, the baseline returns file names and grep matches. That is enough to compare against the agent. A richer production baseline might return ranked files, candidate evidence spans, or known-error signatures.
The baseline should also be cheap to run. If the baseline takes longer than the agent, costs more than the agent, or requires more operational complexity, it may not be a useful control. The local workflow is deliberately simple: one file listing and one grep. Its purpose is to make the evidence surface explicit, not to solve every triage problem.
When adapting this pattern, define the baseline artifact before writing the agent prompt. For a customer-support agent, the baseline might retrieve relevant knowledge-base articles and return their IDs. For a data-analysis agent, it might validate schema and compute summary statistics. For an infrastructure agent, it might collect logs and metrics without proposing remediation.
The agent then has to justify itself. It should improve diagnosis, prioritization, summarization, or decision quality in a way that can be measured against the baseline.
Anti-Pattern: Autonomy as a Requirement Shortcut
An agent is sometimes introduced because the product requirement is vague: “look around and figure it out.” That can be valid when the task genuinely involves open-ended investigation. It can also be a sign that the system boundary has not been designed.
Before accepting such a requirement, ask what a competent human would inspect first. If the answer is stable, encode that as a workflow. If the answer depends on observations, encode the decision points and trace them. The goal is not to eliminate agent behavior; the goal is to know where it starts.
Migration Path
The migration from workflow to agent should be incremental. Start with a fixed workflow. Add tracing. Add evals. Add policy. Add a structured output schema. Then replace one deterministic decision with a model decision. Keep the rest of the system unchanged.
That migration path gives you a clean comparison. If the model-backed file selector improves evidence coverage, the eval should show it. If it increases hallucinated files, the eval should show that too. If it increases context size, the report should show the cost. Without this staged migration, model changes and runtime changes become tangled.
In practice, this is also how you keep code review sane. Reviewers can inspect one change in authority at a time.
Staff Practice Notes
When a team proposes an agent, ask them to write the workflow version in the same meeting. This is not a trick. It reveals whether the task is underspecified, whether autonomy is actually needed, and whether the team has a baseline for comparison. If the workflow is impossible to describe, the product requirement may be unclear. If the workflow is easy and sufficient, the agent may be unnecessary.
Do not confuse deterministic with simplistic. Many production workflows are sophisticated: retrieval, ranking, validation, routing, human approval, and model calls can all exist in a fixed path. The question is whether the path is chosen before execution. That distinction matters for testing, cost prediction, and incident response.
Operational Invariants
The workflow baseline should remain executable after the agent path exists. This is not nostalgia for simpler code; it is an operational control. The baseline gives CI a fast smoke test, gives incident response a fallback, and gives evals a comparison target when model behavior changes.
The agent path should name the dynamic decision it owns. “Use a model” is not specific enough. The model might select files, rank evidence, choose tools, decide whether to escalate, or summarize findings. Each dynamic decision has different failure modes and different eval requirements. If the team cannot name the decision, it cannot measure whether autonomy helped.
The workflow and agent should share deterministic infrastructure where possible: tools, policy, trace writer, schema validation, fixture repos, and report generation. Shared infrastructure makes differences easier to interpret. If both paths fail, inspect shared components. If only the agent fails, inspect the dynamic decision boundary.
The Lab
python scripts/run_all_examples.py --example workflow_baselineReading the Lab Output
The baseline returns a small JSON object. The important field is not only the list of matches; it is the absence of hidden behavior. There is no model call, no implicit memory, no tool choice, and no mutation. If the output is wrong, the search space for debugging is small.
This makes the workflow a useful test harness. When the agent is introduced later, you can ask whether it preserved the same evidence while adding diagnosis. If the agent misses test_calculator.py, the workflow proves the file was available.
The output should also be read as a cost and latency baseline. A deterministic workflow has a known number of tool calls and no model invocation. When an agent path is added, every extra file read, retry, context segment, or model call should justify itself against this floor.
Code Walkthrough
run_workflow_baseline in scripts/run_all_examples.py creates a ToolPolicy, lists files under buggy_calc, and greps for divide|division. The policy still matters even though there is no agent. A deterministic workflow can still leak data if its tools are unsafe.
The function is intentionally procedural. It does not hide control flow behind an abstraction because the chapter is about seeing the fixed path. A reviewer can inspect the order: construct policy, list files, search content, build JSON, and print the artifact. There is no planning step and no hidden state.
This makes the baseline useful in two ways. First, it gives the reader a fast command that should keep working while the rest of the repo grows. Second, it demonstrates that safety boundaries are not exclusive to agents. Any code path that reads files needs confinement and output discipline.
When extending the repo, keep this baseline boring. If it starts accumulating dynamic behavior, it stops being a clean comparison. Add a separate agent path for runtime uncertainty; do not smuggle autonomy into the workflow and then compare the system against itself.
Expected Output
The expected file set is stable:
README.md
calculator.py
test_calculator.py
The expected matches point to the contract mismatch: documentation and tests mention division-by-zero behavior, while the implementation returns a / b.
The output also proves that the workflow has a bounded evidence path. If future code changes make the workflow inspect additional files, omit test_calculator.py, or return nondeterministic ordering, the baseline has changed. That change may be justified, but it should be reviewed as a behavior change.
Failure Mode
Overbuilding a known path as an agent makes the system harder to test, harder to debug, and harder to cost. This is an author design judgment. The repo demonstrates the alternative: keep the deterministic floor and only add agent-shaped behavior where the runtime boundary is worth studying.
The symptom is architectural inflation. A task with a stable decision path gains planning loops, dynamic tool selection, broad prompts, and model-dependent control flow. The resulting system may feel modern, but it loses useful invariants: step count, evidence order, cost, latency, and failure localization. The team then has to evaluate autonomy that was not required for the task.
The root cause is skipping the baseline. Without a deterministic workflow, there is no control group. An agent is judged against the excitement of a blank page rather than against the simpler system that might have solved the same problem. This matters most when the task is operationally important, because unnecessary variance becomes a production burden.
The artifact that exposes the failure is a baseline comparison. If the workflow and agent inspect the same evidence, return the same diagnosis, and differ mostly in cost and variance, the workflow should remain the default. If the agent handles cases where the workflow cannot choose evidence or next steps, then the autonomy boundary is justified and testable.
Production Translation
In production, deterministic paths are easier to review, cache, monitor, and roll back. Agentic behavior should be introduced where the task contains genuine runtime uncertainty.
The production version of this lesson is architectural restraint. A fixed workflow can still be sophisticated: it can use retrieval, ranking, model calls, validators, typed schemas, and human review. The distinction is not “simple versus advanced.” The distinction is whether the path is known before execution.
When reviewing an agent proposal, ask for the deterministic baseline. If no one can describe it, the team may be using “agent” as a placeholder for unclear product behavior. If the baseline is strong, the agent should be evaluated against it rather than against an empty demo.
The rollout implication is simple: ship the workflow first when the workflow solves the job. Then run the agent in shadow mode against the same inputs. Compare task success, evidence quality, latency, cost, policy violations, and review burden. Only promote the agent when it wins on a metric that matters to the product. If it merely produces more elaborate prose, keep the workflow and spend engineering effort elsewhere.
Design Review Questions
Before replacing a workflow with an agent, ask:
- What exact workflow exists today?
- Which step fails because the path is unknown?
- What new decision will the model make?
- What evidence proves that decision was better?
- What variance does the model introduce?
- What deterministic fallback remains?
- How will cost and latency compare to the workflow?
An agent proposal should identify the first dynamic decision, not simply assert that the whole task is dynamic.
Review Rubric
Reject an agent proposal that cannot describe the deterministic baseline. If the team cannot write the fixed path, it cannot prove that dynamic behavior is necessary.
Require review when the agent adds dynamic behavior but no comparison metrics. The proposal should state which metric improves: evidence coverage, task success, latency, cost, review time, or failure handling.
Accept the agent path only when the first dynamic decision is named and the workflow remains available as baseline, fallback, and regression harness.
Implementation Notes
Keep the workflow callable even after the agent exists. It can run in CI as a smoke test. It can provide fallback behavior when model access is unavailable. It can produce diagnostic artifacts when the agent fails. Removing the baseline too early makes future regressions harder to localize.
If the workflow and agent share tools, the tool tests become more valuable. A bug in read_file affects both paths. A policy regression affects both paths. Shared deterministic components are leverage.
Extension Path
A useful extension is a workflow-to-agent comparator. Run the deterministic workflow and the agent on the same fixture, then compare inspected files, finding count, trace event count, estimated context, latency, and report status. This turns the baseline into a measurable control group.
Do not start by comparing prose. Start by comparing artifacts. If the agent inspects more relevant evidence and preserves policy, the prose can be evaluated afterward. If the artifacts are worse, fluent prose should not rescue the design.
Worked Scenario: When the Workflow Wins
Suppose the task is to detect whether a repository contains a division-by-zero contract mismatch. If the organization has a known signature, a deterministic workflow may be enough: inspect relevant files, search for expected phrases, and emit a structured finding. An agent may add prose, but it may not add operational value.
Now suppose the task changes: “Find why the latest CI run failed across an unfamiliar repository.” The relevant files might include tests, configuration, logs, or recent commits. The next step may depend on the first observation. This is a stronger case for agent-shaped behavior. The key is that the justification is specific. The agent is needed because file and evidence selection are uncertain at runtime, not because “agents are the future.”
That level of specificity keeps autonomy honest.
Chapter Synthesis
The workflow baseline is the book’s control group. It prevents the agent discussion from floating above engineering reality. If a fixed path solves the task, the fixed path should be respected. If a dynamic path is needed, the baseline clarifies exactly where the dynamic boundary begins.
This is also a cultural lesson. Teams often adopt autonomy because requirements feel unclear. A staff-level review reverses that pressure: clarify the deterministic path first, then add autonomy only where uncertainty remains. That discipline makes later evals and rollout decisions much sharper.
Evidence and References
The workflow/agent distinction is cited from Anthropic’s engineering guidance (Anthropic 2024). The local evidence is the workflow_baseline output.
Takeaways
- A workflow is the control group for autonomy.
- Dynamic behavior should own a named decision, not the whole task by default.
- Keep the deterministic path callable for fallback, CI, and comparison.
Exercises
- Convert one agent idea into a deterministic workflow. Write the workflow as ordered steps and mark the first step where a model would add value.
- Identify the first step that cannot be hard-coded cleanly. Explain whether the uncertainty is about input selection, tool selection, interpretation, prioritization, or action.
- Add a success criterion before adding a model decision. The criterion should be measurable from an artifact rather than a human impression.
- Decide which workflow observations should be traced. Separate evidence needed for debugging from evidence needed for compliance or release review.
- Implement a second workflow baseline for a new fixture repo. Keep it deterministic and add one test that compares expected output to actual output.
- Write a regression test that would fail if the agent path stopped matching the workflow baseline on
buggy_calc. - Design an escalation rule: when should the workflow hand off to an agent, and when should the agent fall back to the workflow?
- Estimate the operational cost of the workflow and the agent version in latency, context size, tool calls, and review burden.
Checklist
- Use workflows as the control group.
- Do not add autonomy to hide unclear requirements.
- Keep policy around deterministic tools too.
- Preserve the workflow path after the agent path exists.
- Evaluate the agent against a baseline, not against an empty demo.
- Require a concrete dynamic decision before accepting agent complexity.
- Keep shared tools tested independently from orchestration.
- Make fallback behavior explicit before deployment.