UK Property Analytics Warehouse

What I built

An analytics-engineering project on 4.99M UK Land Registry property transactions: a dbt + DuckDB warehouse layered staging → intermediate → facts and dimensions → reporting marts, with 88 dbt tests, GitHub Actions CI, generated docs with full column-level lineage, and a Streamlit dashboard for exploration.

Why this matters

Most ML coursework lives in single-machine notebooks. This project covers the other half of every UK data role: writing SQL that scales, modelling data into facts and dimensions, testing every model, documenting every column, and shipping a dashboard a non-engineer can actually use.

Architecture

HM Land Registry CSV (4.99M rows)
        │
        ▼
   raw_landreg  ──▶  staging  ──▶  intermediate  ──▶  dim_* / fct_*  ──▶  rpt_*
                       │              │                   │
                       ▼              ▼                   ▼
                 stg_landreg__   int_transactions    fct_transactions
                  transactions    __enriched          dim_postcode
                                                      dim_property_type
                                                      dim_date
                                                      dim_tenure

Data layers

Staging — type casting, renaming, light cleanup. One model per source table. No business logic.
Intermediate — joined-and-enriched models that derived columns and reference data depend on.
Marts (core) — dimensions (dim_*) and facts (fct_*) modelled for clean analytical querying. Grain is documented on every mart.
Marts (analytics) — reporting models (rpt_*) that pre-compute the common analytical questions for the dashboard.

Tests

88 dbt tests across three layers:

Generic — not_null, unique, accepted_values, relationships on every staging and mart model.
Custom singular — business invariants ("no negative prices", "no future transactions", "every postcode resolves to a region").
dbt-expectations — distribution checks on key numeric columns.

What the dashboard answers

Year-on-year price change by region.
New-build premium over existing stock over time.
Top postcodes by transaction volume.
Property-type mix shift since 1995.
Tenure split (freehold vs leasehold) by region.

Engineering practices

CI — every PR runs dbt build against a small sample in under 90 seconds; sqlfluff lints the SQL.
Docs — dbt docs generate published to GitHub Pages on every push to main.
Reproducibility — fully buildable from a fresh clone in under five minutes; pinned package versions; no manual setup steps.

What I'd do differently

Add an SCD2 snapshot on postcode-region mapping (boundaries change occasionally). Migrate the warehouse from DuckDB to Snowflake or BigQuery once volume justifies it — the dbt models would port without change. Orchestrate the refresh with Dagster instead of a GitHub Actions cron, so I get asset-based modelling out of the box.