Skip to content
All projects

2026 · Solo project · 3 min read

UK Property Analytics Warehouse

End-to-end analytics engineering on 4.99M UK Land Registry property transactions. dbt + DuckDB warehouse with staging, intermediate, fact, dimension, and reporting layers, 88 data tests, CI, docs, and a Streamlit dashboard.

  • SQL
  • dbt
  • DuckDB
  • Streamlit
  • GitHub Actions
  • Python

What I built

An analytics-engineering project on 4.99M UK Land Registry property transactions: a dbt + DuckDB warehouse layered staging → intermediate → facts and dimensions → reporting marts, with 88 dbt tests, GitHub Actions CI, generated docs with full column-level lineage, and a Streamlit dashboard for exploration.

Why this matters

Most ML coursework lives in single-machine notebooks. This project covers the other half of every UK data role: writing SQL that scales, modelling data into facts and dimensions, testing every model, documenting every column, and shipping a dashboard a non-engineer can actually use.

Architecture

HM Land Registry CSV (4.99M rows)
        │
        ▼
   raw_landreg  ──▶  staging  ──▶  intermediate  ──▶  dim_* / fct_*  ──▶  rpt_*
                       │              │                   │
                       ▼              ▼                   ▼
                 stg_landreg__   int_transactions    fct_transactions
                  transactions    __enriched          dim_postcode
                                                      dim_property_type
                                                      dim_date
                                                      dim_tenure

Data layers

  • Staging — type casting, renaming, light cleanup. One model per source table. No business logic.
  • Intermediate — joined-and-enriched models that derived columns and reference data depend on.
  • Marts (core) — dimensions (dim_*) and facts (fct_*) modelled for clean analytical querying. Grain is documented on every mart.
  • Marts (analytics) — reporting models (rpt_*) that pre-compute the common analytical questions for the dashboard.

Tests

88 dbt tests across three layers:

  • Genericnot_null, unique, accepted_values, relationships on every staging and mart model.
  • Custom singular — business invariants ("no negative prices", "no future transactions", "every postcode resolves to a region").
  • dbt-expectations — distribution checks on key numeric columns.

What the dashboard answers

  • Year-on-year price change by region.
  • New-build premium over existing stock over time.
  • Top postcodes by transaction volume.
  • Property-type mix shift since 1995.
  • Tenure split (freehold vs leasehold) by region.

Engineering practices

  • CI — every PR runs dbt build against a small sample in under 90 seconds; sqlfluff lints the SQL.
  • Docsdbt docs generate published to GitHub Pages on every push to main.
  • Reproducibility — fully buildable from a fresh clone in under five minutes; pinned package versions; no manual setup steps.

What I'd do differently

Add an SCD2 snapshot on postcode-region mapping (boundaries change occasionally). Migrate the warehouse from DuckDB to Snowflake or BigQuery once volume justifies it — the dbt models would port without change. Orchestrate the refresh with Dagster instead of a GitHub Actions cron, so I get asset-based modelling out of the box.