2026 · Solo project · 3 min read
LLM Red-Team Evaluation Harness
Reproducible benchmark measuring how published adversarial prompts perform against 2026-era LLMs and whether prompt-only defences move the needle — with cross-judge validation and bootstrap confidence intervals.
- Python
- Claude Sonnet 4.6
- Llama 3.1 8B
- Inspect AI
- GitHub Actions
- pytest
- ruff
- mypy
What I built
A fully reproducible evaluation harness for measuring attack success rate (ASR) of published adversarial prompt corpora against two target models — Claude Sonnet 4.6 (frontier API) and Llama 3.1 8B (local via Ollama) — under composable defence configurations.
The v1 evaluation matrix covers 12 cells: 2 target models × 2 benchmark families (AdvBench direct attacks, AgentDojo static indirect injection) × up to 4 defence stacks.
Headline finding
Published adversarial prompts succeed between 0% and 4% of the time across all 12 cells. A paranoid prompt-only defence stack does not measurably move that number. The honest interpretation: 2026-era instruction tuning already neutralises these static, published attacks on both a frontier and a small local model.
What the harness measures
- Corpus loading — AdvBench, JailbreakBench, HarmBench, and AgentDojo, each pinned to an upstream commit for reproducibility
- Defence stacks — paranoid system prompt, Constitutional critique-and-revise, Spotlighting, SecAlign-style structured queries (composable, toggled via YAML run configs)
- Scoring pipeline — rule-based pre-screen → LLM judge → independent cross-judge for validation
- Statistical rigour — ASR with 95% percentile-bootstrap confidence intervals, Cohen's κ and Krippendorff's α for inter-judge agreement, real API cost per run
Cross-judge validation
Every attack-success verdict is scored by one judge model and independently re-scored by a second. Cross-judge κ = +1.00 across all 12 cells on ASR — the metric is well-posed. The harness also discovered that refusal_rate is not well-posed: the two judges agree on whether an attack succeeded, but disagree — sometimes worse than chance — on whether a response was a "refusal" in the indirect-injection setting, because there are two things that can be refused. This is documented in METHODOLOGY.md, not quietly omitted.
Inspect AI compatibility
Any run exports to a UK AI Security Institute Inspect eval log, so results load directly in inspect view or via read_eval_log(). Cross-judge agreement, confidence intervals, and cost travel in the log metadata.
Why ASR, not refusal rate
Building a measurement tool that reports how trustworthy its own metrics are was a core design goal. The cross-judge layer is what surfaced the refusal_rate problem. Reporting a metric you know is unreliable — without documenting it — is exactly the kind of methodological noise that makes safety benchmarking literature hard to interpret.
Ethical design
Only published adversarial prompts are used. Excluded categories (CSAM, weapons-of-mass-destruction synthesis, detailed self-harm methods) are filtered at corpus-load time and verified by a CI test. No raw harmful outputs are committed to the repo. See ETHICS.md for the full policy.
Limits and future work
The open risk these static benchmarks under-measure is the full agentic loop — interactive, multi-turn tool use with real actions. That is named explicitly as future work in METHODOLOGY.md, not quietly omitted. Next tracks: full AgentDojo agent loop, multi-turn attacks, and expanded model coverage.