Available for full-time roles from Oct 2026

[ Ross King ]Sheffield, UK

I work on AI evaluation and reliability.

MSc Artificial Intelligence candidate at the University of Sheffield. I build software for evaluating AI honestly: release gates for agents, benchmarks that can fail, and numbers you can trace back to the test that produced them. The same discipline runs through the data work — Spark-scale backfills, dbt warehouses and ML forecasts.

Roles →

AI Evaluation & ReliabilityML EngineeringData Engineering

View projects Download CV GitHub LinkedIn

projects shipped: 9 projects shipped
tests across them: 1,681 tests across them
live demos: 6 live demos

Viewing as →

[ 01 ]Featured work

Three projects, shown working

Numbered 01–03 by what I’d want you to read first — not by date. The screenshots are the actual apps; every card links to the evidence.

All 10 projects

052026 · 5 min read

Agent Release Safety Gates

An installable release-gate for AI agents (pip install agent-release-gates): replay known incidents, apply policy-as-code gates, and produce ship / warn / block evidence — as a CLI that fails CI, a UK AISI Inspect eval, or a runner pointed at your own agent's traces. Its most useful result is a negative one: the project's own synthetic benchmark turned out to be circular by construction, and the same retriever scores about twenty points lower on 640 external public cases. The external number is the one reported.

external retrieval hit@3: 79.92% external retrieval hit@3
public RAG cases: 640 public RAG cases
tests: 309 tests

PythonuvInspect AIPydantic

Read the write-up

Screenshot of the Agent Release Safety Gates live demo — agent-release-gates.streamlit.appLive

run log

$ redteam-foundry run --matrix v1 # 12 cells: 2 models × 2 corpora

» attack success: 0–4% across all 12 cells

» positive control (llama2-uncensored): 80% [72, 87] — the pipeline can detect

» cross-judge κ = +0.935 on the control; undefined in 11 of 12 cells

» exports to a UK AISI Inspect eval log

062026 · 4 min read

redteam-foundry

Adversarial benchmark foundry for LLM safety (pip install redteam-foundry): 1,883 prompts from four pinned corpora across 12 evaluation cells, with bootstrap confidence intervals and per-run API cost. Attack success came out at 0–4% — a negative result, so a known-vulnerable control model was pushed through the identical pipeline and scored 80%, which is what makes the zero mean something.

PythonClaude Sonnet 4.6Llama 3.1 8B

012026 · 3 min read

London Cycle-Hire Analytics Platform

Answers one question well: when London's transport is disrupted, how much extra demand lands on the bikes, and where? A 41.4M-journey PySpark backfill unified across five drifting schema eras, a tested dbt star schema, a LightGBM station-level forecast, and a free always-on live layer refreshed daily by GitHub Actions into committed Parquet — no warehouse to keep alive. Headline: strike days run about 1.4× median demand, up to ~2.3× on the worst full-network strike day.

PythonPySparkdbt

[ 01 ]Featured work

Four projects, shown working

Numbered 01–04 by what I’d want you to read first — not by date. The screenshots are the actual apps; every card links to the evidence.

All 10 projects

052026 · 5 min read

Agent Release Safety Gates

external retrieval hit@3: 79.92% external retrieval hit@3
public RAG cases: 640 public RAG cases
tests: 309 tests

PythonuvInspect AIPydantic

Read the write-up

072026 · 3 min read

Cited Market Brief Agent

A region-aware morning-market web app and an audit-ready, evidence-backed brief engine in one. The radar surfaces a market clock, a FRED overnight-risk rail, most-read finance news with AI summaries, and a Taiwan ETF-vs-benchmark attribution tool; the brief engine generates company briefs from SEC EDGAR + FRED, attaching a stored source span to every claim it accepts, with a click-through evidence ledger. Four localised editions (Taiwan, Korea, UK, EU). The CI gate that once certified this at 100% turned out to be scoring itself; measured against independently labelled ground truth, 40% of accepted claims are genuinely supported by the span they cite.

TypeScriptReactFastAPI

042026 · 3 min read

Aerospace Prognostics

Deployable end-to-end PHM MLOps, not another leaderboard notebook: NASA C-MAPSS turbofan RUL and ESA spacecraft-telemetry anomaly detection carried through their real evaluation protocols, wrapped in a FastAPI serving API, an operator console, signed release evidence (model card, SBOM, provenance), drift monitoring, and 462 tests. The evaluation layer proved general enough to extract as telemeval, a standalone library on PyPI with a Zenodo DOI.

PythonFastAPIStreamlit

run log

$ redteam-foundry run --matrix v1 # 12 cells: 2 models × 2 corpora

» attack success: 0–4% across all 12 cells

» positive control (llama2-uncensored): 80% [72, 87] — the pipeline can detect

» cross-judge κ = +0.935 on the control; undefined in 11 of 12 cells

» exports to a UK AISI Inspect eval log

062026 · 4 min read

redteam-foundry

PythonClaude Sonnet 4.6Llama 3.1 8B

[ 01 ]Featured work

Three projects, shown working

Numbered 01–03 by what I’d want you to read first — not by date. The screenshots are the actual apps; every card links to the evidence.

All 10 projects

012026 · 3 min read

London Cycle-Hire Analytics Platform

journeys unified: 41.4M journeys unified
median station-day demand: 1.42× median station-day demand
dbt data tests: 92 dbt data tests

PythonPySparkdbtDuckDB

Read the write-up

022026 · 3 min read

England & Wales Housing Decision Support

Explainable where-to-live decision support for England & Wales. A tested dbt + DuckDB engine turns nine open-data sources into five transparent 0–100 indicators across 7,264 neighbourhoods, every score shown beside the raw figure it came from, served through a public FastAPI and a Next.js site with ~7k programmatic area pages. 228 dbt data tests + 2 unit tests, a versioned cross-runtime scoring contract, Dagster-orchestrated refresh, published lineage docs.

dbtDuckDBDagster

082026 · 3 min read

Responsible Neobank Growth

A synthetic neobank whose backend events misbehave on purpose — late, duplicated, reversed, schema-evolving — generated against a known-truth manifest so a governed dbt warehouse can be checked rather than trusted. On top sit the responsible-growth consumers: experimentation (CUPED, SRM, difference-in-differences, synthetic control), a calibrated activation model, and a release-gate that weighs customer-outcome guardrails. Run once on BigQuery: 68 dbt models under 217 data tests and 400 pytest tests, with full-refresh and incremental matching exactly at all six governed interfaces.

PythondbtBigQuery

All 10 projects