Skip to content
All projects
[ 02 ]2026Solo project3 min read

Agent Release Safety Gates

An installable release-gate for AI agents (pip install agent-release-gates): replay known incidents, apply policy-as-code gates, and produce ship / warn / block evidence — as a CLI that fails CI, a UK AISI Inspect eval, or a runner pointed at your own agent's traces. Backed by a synthetic evaluation lab: 358 golden cases, 60 red-team cases, TechQA/WixQA public RAG validation, safety-classifier and tool-governance checks, and six baseline-vs-intervention safety studies.

  • Python
  • uv
  • Inspect AI
  • Pydantic
  • FastAPI
  • Streamlit
  • OpenTelemetry
  • pytest
  • Docker
  • GitHub Actions
358
golden cases
90.91%
safety recall
100%
side-effect block

What I built

A release-readiness gate for AI-agent changes, published as a pip-installable package. Before a changed agent, prompt, model, or tool policy ships, it replays known incidents, applies policy-as-code gates, and produces a single ship / warn / block decision with the evidence behind it.

pip install agent-release-gates
 
# Run the deterministic gate — exits non-zero on a blocking failure, so it
# drops straight into CI.
agent-safety release-gate --policy config/incident_release_policy.json

The core install is lean (just pydantic); the FastAPI evidence service and Streamlit reviewer dashboard are opt-in extras. The incident-replay suite also runs as a UK AI Security Institute Inspect eval.

Evaluate your own agent

It isn't tied to my synthetic agent. Export a real agent's results — generic logs or LangChain/LangSmith-style traces — and score them against the gates, or drive a live LLM through the replay against any OpenAI-compatible endpoint. The project ships candidate-results exporters and schemas so an external agent can be gated the same way.

How it works

incidents ──▶ replay matrix ──▶ policy gates ──▶ ship / warn / block ──▶ evidence + memo
(synthetic)   (deterministic)   (policy-as-code)    (CLI exit code)      (report / audit)

Incident replay (the first gate)

The first module turns redacted synthetic incidents into regression fixtures. 8 seeded incidents are replayed on every change with a 100% closure rate and 0 replay must-not violations, each producing a release-gate decision and an incident memo.

Evaluation evidence (the layer underneath)

The controlled benchmark covers 358 synthetic golden cases and 60 red-team cases across 24 runbook sections and 180 synthetic tickets, plus compact public RAG tracks (160 TechQA, 80 WixQA) evaluated separately. Headline results: 100% retrieval hit rate@3, a safety classifier at 90.91% recall with 0 high-severity false negatives, and 100% block + audit rate on unapproved side-effecting tool calls.

On top of the benchmark sit six baseline-vs-intervention safety studies — instruction hierarchy, action-risk gates, safety-classifier review policy, RAG grounding, memory/context pollution, and goal conflict — and reviewed OpenAI and Anthropic judge-calibration runs.

Key finding

Safety scores aren't meaningful on their own. The lab reports over-review cost, benign auto-blocks, weak-evidence handling, and unsafe misses beside the headline numbers — so a "safe" result that quietly buries the team in review is visible, not hidden.

Delivery

  • pip-installable package + an agent-safety CLI with CI-friendly exit codes
  • UK AISI Inspect eval for the incident-replay suite
  • Candidate-results exporters (generic logs + LangChain/LangSmith traces)
  • FastAPI evidence service and Streamlit dashboard (opt-in extras)
  • GitHub Pages report + PDF, Docker / Compose, and CI running lint, tests, deterministic report regeneration, and an OpenTelemetry smoke test

Honest limits

The benchmark is synthetic and still partly templated; the public TechQA/WixQA tracks use compact samples, not the full datasets. Human-review labels are currently simulated workflow labels — independent reviewer labels are prepared but not yet published — and the hosted-model evidence is judge-calibration, not a broad multi-model agent comparison. Each limit is documented in the repo rather than presented as a production claim.