15 Production Readiness
Learning Objective
After this chapter, you should be able to turn a local agent run into deployment gates and review criteria.
Why This Matters
Production readiness is not a feeling. NIST AI RMF frames AI risk management around governance, mapping, measurement, and management functions (National Institute of Standards and Technology 2023). This repo keeps the scope smaller: it creates a local readiness report that links trace, eval, policy, and context signals.
Core Concept
Deployment gates should cover:
- tool safety,
- state and durability,
- observability,
- evals,
- guardrails,
- human approval,
- secrets,
- auth,
- rate limits,
- cost caps,
- rollback,
- incident review.
Gate Design
A gate should have four properties:
- It is measurable.
- It has an owner.
- It has a failure action.
- It is tied to an artifact.
For example, “evals should pass” is incomplete. “The default eval suite must pass in CI; failures block release and page the runtime owner” is a gate. “Context should be reasonable” is vague. “Dynamic context above a threshold sets deployment status to human review required” is a gate.
The current sample report demonstrates the second pattern. Evals pass, but large dynamic context still triggers review.
Rollback Discipline
Agentic systems can change through prompts, models, tool definitions, policy, retrieval indexes, eval sets, and code. Rollback needs to know which layer changed. A model rollback will not fix an unsafe tool. A prompt rollback will not restore a deleted file. A policy rollback can re-enable a risky capability if not reviewed.
For a production system, version these surfaces separately and record their versions in traces or reports.
Incident Questions
After a bad run, the incident review should ask:
- What did the user request?
- What tools were available?
- What tools were called?
- What evidence entered context?
- What eval would have caught this?
- What policy would have blocked it?
- What version changed most recently?
- What rollback is available?
If the current artifacts cannot answer those questions, the readiness work is incomplete.
Case Study Step
The current sample report is not a production approval. It is a review packet. It says that deterministic evals pass, policy violations are absent, and context growth still requires human review. That is a more useful state than a binary “works” or “does not work.”
Readiness Matrix
A practical readiness review can use a matrix:
| Area | Evidence | Local Artifact |
|---|---|---|
| Tools | path and shell policy | ToolPolicy, tests |
| Observability | run reconstruction | JSONL trace |
| Evals | deterministic gates | eval report |
| Context | growth and warnings | context summary |
| Safety | prompt-injection fixture | eval and policy |
| Deployment | review status | production report |
The matrix is incomplete for real deployment, but it is enough to structure a review. Missing cells are not failures; they are work items.
Ownership
Every gate needs an owner. Tool policy may belong to platform or security. Eval fixtures may belong to the product/runtime team. Context budgets may belong to ML infrastructure. Deployment approval may belong to an engineering lead. If ownership is unclear, failures become meetings instead of actions.
This book cannot assign ownership for your organization. It can make the ownership surfaces visible.
Readiness Is Continuous
Readiness is not a one-time launch checklist. Tool surfaces change. Models change. Prompt layouts change. Eval suites grow. User behavior shifts. A system that was acceptable for a limited rollout may be unacceptable after adding write tools or increasing traffic.
The report should therefore be regenerated as part of change review. If the report is only created once, it becomes documentation. If it is regenerated on every meaningful change, it becomes a gate.
Staff Practice Notes
Production readiness is not a mood. It is a set of gates with owners, artifacts, and rollback paths. If the conversation relies on confidence, ask for the report. If the report cannot answer the question, add a gate or narrow the rollout.
A staff engineer should distinguish “ready for local use,” “ready for shadow mode,” “ready for advisory comments,” and “ready for mutation.” These are different claims. Most systems should progress through them deliberately, and each transition should be supported by fresh evidence.
Operational Invariants
Readiness status should be conservative. Missing trace, missing eval, missing policy, missing report, or unbounded context should not silently become “ready.” The system can still be useful locally, but deployment authority should match evidence.
Readiness should be scoped by rollout mode. Local advisory use, shadow mode, automatic comments, and mutating actions have different gates. A system can be ready for one mode and blocked for another. Treating readiness as a single global boolean hides that distinction.
Readiness should include rollback. If the team cannot disable a tool, revert a prompt, switch to a workflow fallback, or stop user-visible automation, the system is not production-ready even if it passes task evals.
The Lab
python -m agentic_systems_lab.reportReading the Lab Output
The command writes the report path. The report’s deployment recommendation should be read as a gate, not a suggestion. If the status requires human review, the reviewer should inspect why. In the current sample, the reason is large dynamic context.
A production version should make that status machine-readable and tie it to ownership.
When reading the recommendation, ask what deployment mode it permits. A warning may still allow local advisory use while blocking automatic comments. A failed eval may block all rollout. A missing trace may permit no claim at all. Readiness is scoped, not absolute.
Code Walkthrough
The report generator produces a deployment recommendation. It requires human review when evals fail, policy violations exist, or large dynamic outputs appear.
That rule is intentionally stricter than task success. A system can diagnose buggy_calc correctly and still be unready for automatic deployment because a different fixture produces context warnings or because a policy violation appeared. Readiness is a system property across tasks, not a single successful run.
The current implementation expresses recommendation as prose. A production version should encode a status enum and reason list, then render prose from that data. That gives CI a stable gate while giving humans a readable report. It also makes missing evidence easier to treat conservatively.
Readiness logic should be tested with synthetic evidence bundles. One test should pass all gates. One should require human review due to warning. One should block due to failed eval or policy violation. This is the same TDD habit applied to deployment policy.
Expected Output
reports/sample_production_report.md includes trace summary, all three eval tasks, context growth, production warnings, and human review required before deployment.
The expected output is a scoped readiness statement. It does not say the system is useless; it says broader deployment needs review. A staff-level rollout should preserve that nuance rather than forcing all outcomes into pass/fail.
Failure Mode
A system with no eval gate, no trace, and no rollback path can work in a demo and still be unsupportable. That is author judgment, but the repo demonstrates the evidence artifacts that reduce ambiguity.
The symptom is a launch conversation driven by confidence rather than artifacts. The team has a strong demo, a few successful manual runs, and no clear answer to what blocks rollout, what requires human review, what changed since the last good run, or how to disable the system safely.
The root cause is treating production readiness as a final checklist instead of a system property. Readiness emerges from the interaction of evals, traces, policy, context budgets, reports, ownership, rollout scope, monitoring, and rollback. Missing any one of those does not always block local experimentation, but it should narrow deployment authority.
The artifact that exposes the failure is a production report with a conservative status. A report that can say “task passed, human review required because context budget warning exists” is more useful than a binary green check. Production readiness often means choosing the right rollout boundary, not declaring universal readiness.
Production Translation
For a real deployment, add ownership, monitoring, incident templates, escalation policy, data retention, and environment-specific security review. This book intentionally stops before cloud deployment.
The stopping point is deliberate. A local deterministic lab can teach the shape of readiness without pretending to solve auth, tenancy, secrets, compliance, or SRE operations. Those belong to the production environment.
The practical translation is a readiness matrix. Rows are gates: schema, trace, evals, policy, context, report, monitoring, rollback, ownership. Columns are deployment stages: local, CI, shadow, advisory, automatic, mutating. Each cell states required evidence. This matrix prevents a common failure: moving from a local proof of concept directly to user-visible automation without noticing which gates never existed.
Design Review Questions
For production readiness, ask:
- What gates block deployment?
- What warnings require human review?
- What artifacts prove each gate?
- Who owns each gate?
- What changed since the last passing report?
- What rollback path exists?
- What incident questions cannot yet be answered?
- What monitoring is needed after rollout?
Readiness is a living system, not a slide.
Review Rubric
Reject readiness claims that lack rollback, ownership, trace evidence, or eval gates. A demo can work while the system remains unsupportable.
Require review when the system is ready for local advisory use but not for automatic or mutating actions. The correct response is scoped rollout, not binary approval.
Accept readiness when deployment status is data, reason codes are explicit, warnings have owners, rollback is tested, and each rollout stage has evidence requirements.
Implementation Notes
The next readiness improvement is to encode deployment status as data. Instead of prose-only human review required, define a status enum and reason list. Reports can still render prose, but CI can read the enum.
The enum should be conservative. Missing evidence should not become success by default.
Extension Path
Encode deployment recommendation as data. Define a status enum such as ready, human_review_required, and blocked, plus reason codes. Then render Markdown from that data. Tests should cover missing trace, failed eval, policy violation, large context warning, and clean pass.
After status is data, add a readiness matrix by rollout stage. The same evidence bundle can permit local advisory use while blocking automatic mutation. That scoped decision is more realistic than a single global launch flag.
Worked Scenario: Release Review
Imagine a release review for the triage bot. The eval report passes. The trace is present. Policy is read-only. Prompt injection is detected without policy violations. Context growth still triggers human review.
The release decision should not be “ship” or “do not ship” by instinct. It should be scoped. Maybe the system can run as a local advisory tool but not post PR comments automatically. Maybe it can inspect small repos but not logs. Maybe it can run in shadow mode. Production readiness is often about choosing the right rollout boundary.
Chapter Synthesis
Production readiness combines the book’s surfaces into a deployment decision. Evals, traces, policy, context, reports, ownership, and rollback all matter because real systems fail across boundaries. A correct answer is necessary but not sufficient.
The chapter’s practical contribution is scoped readiness. A system can be ready for local advisory use and not ready for automatic mutation. It can be ready for shadow mode and not ready for broad rollout. Mature deployment decisions preserve those distinctions.
Evidence and References
Risk-management framing cites NIST AI RMF (National Institute of Standards and Technology 2023). Local readiness behavior is repo evidence from the generated report.
Takeaways
- Readiness is scoped by rollout mode and supported by artifacts.
- Missing evidence should produce conservative status, not implicit approval.
- Rollback and ownership are part of the deployment contract.
Exercises
- Add a cost cap warning. Decide whether the cap is per run, per repository, per day, or per deployment stage.
- Add an incident template. Include run ID, trace path, policy decisions, eval status, context warnings, user impact, and rollback state.
- Add a rollback checklist. Distinguish disabling the agent, disabling a tool, reverting a prompt, reverting a model, and switching to workflow fallback.
- Decide which tools require human approval. Tie each approval requirement to authority, reversibility, and blast radius.
- Convert deployment recommendation into a status enum with reason codes. Add tests for pass, human review, and block.
- Define a shadow-mode rollout plan with entry criteria, monitoring signals, exit criteria, and owner sign-off.
- Write an on-call query list: the first ten questions an incident reviewer should be able to answer from artifacts.
- Compare readiness for local advisory use, automatic PR comments, and mutating code changes.
Checklist
- Deployment gates should be explicit.
- Reports should summarize evidence, not replace raw traces.
- Human review is a valid status.
- Missing evidence should default to conservative status.
- Rollout scope is part of readiness.
- Approval requirements should follow tool authority.
- Incident review should start from run artifacts.
- Production ownership must include eval, policy, and observability maintenance.