Selected work

Projects

A few representative builds. Filter by stack to narrow down.

LLM Red-Team Evaluation Harness
2026
Reproducible benchmark measuring how published adversarial prompts perform against 2026-era LLMs and whether prompt-only defences move the needle — with cross-judge validation and bootstrap confidence intervals.
- Python
- Claude Sonnet 4.6
- Llama 3.1 8B
- Inspect AI
- GitHub Actions
- pytest
- ruff
- mypy
View source