2026 · Solo project · 3 min read
UK Property Analytics Warehouse
End-to-end analytics engineering on 4.99M UK Land Registry property transactions. dbt + DuckDB warehouse with staging, intermediate, fact, dimension, and reporting layers, 88 data tests, CI, docs, and a Streamlit dashboard.
- SQL
- dbt
- DuckDB
- Streamlit
- GitHub Actions
- Python
What I built
An analytics-engineering project on 4.99M UK Land Registry property transactions: a dbt + DuckDB warehouse layered staging → intermediate → facts and dimensions → reporting marts, with 88 dbt tests, GitHub Actions CI, generated docs with full column-level lineage, and a Streamlit dashboard for exploration.
Why this matters
Most ML coursework lives in single-machine notebooks. This project covers the other half of every UK data role: writing SQL that scales, modelling data into facts and dimensions, testing every model, documenting every column, and shipping a dashboard a non-engineer can actually use.
Architecture
HM Land Registry CSV (4.99M rows)
│
▼
raw_landreg ──▶ staging ──▶ intermediate ──▶ dim_* / fct_* ──▶ rpt_*
│ │ │
▼ ▼ ▼
stg_landreg__ int_transactions fct_transactions
transactions __enriched dim_postcode
dim_property_type
dim_date
dim_tenure
Data layers
- Staging — type casting, renaming, light cleanup. One model per source table. No business logic.
- Intermediate — joined-and-enriched models that derived columns and reference data depend on.
- Marts (core) — dimensions (
dim_*) and facts (fct_*) modelled for clean analytical querying. Grain is documented on every mart. - Marts (analytics) — reporting models (
rpt_*) that pre-compute the common analytical questions for the dashboard.
Tests
88 dbt tests across three layers:
- Generic —
not_null,unique,accepted_values,relationshipson every staging and mart model. - Custom singular — business invariants ("no negative prices", "no future transactions", "every postcode resolves to a region").
dbt-expectations— distribution checks on key numeric columns.
What the dashboard answers
- Year-on-year price change by region.
- New-build premium over existing stock over time.
- Top postcodes by transaction volume.
- Property-type mix shift since 1995.
- Tenure split (freehold vs leasehold) by region.
Engineering practices
- CI — every PR runs
dbt buildagainst a small sample in under 90 seconds; sqlfluff lints the SQL. - Docs —
dbt docs generatepublished to GitHub Pages on every push tomain. - Reproducibility — fully buildable from a fresh clone in under five minutes; pinned package versions; no manual setup steps.
What I'd do differently
Add an SCD2 snapshot on postcode-region mapping (boundaries change occasionally). Migrate the warehouse from DuckDB to Snowflake or BigQuery once volume justifies it — the dbt models would port without change. Orchestrate the refresh with Dagster instead of a GitHub Actions cron, so I get asset-based modelling out of the box.