Scalable ML with PySpark on Sheffield HPC

What I built

Four end-to-end distributed ML pipelines on the Stanage HPC cluster, each covering a different ML pattern at production scale.

1. Web access log mining (1.9M records)

Parsed and aggregated NASA web logs to surface time-of-day access heatmaps and resource-popularity rankings. Streaming-friendly transformations using PySpark DataFrames so the same code scales to logs orders of magnitude larger.

2. Traffic prediction (GLM on 5M+ records)

Built a Generalised Linear Model regression pipeline over a UK Highways traffic dataset with proper temporal train/test splits — predicting next-interval volume from rolling-window features. Evaluated against a baseline mean predictor for honest reporting.

3. HIGGS binary classification (11M records)

Compared Logistic Regression, Random Forest, and Gradient Boosted Trees on the HIGGS particle-physics dataset. Used cross-validation with hyperparameter tuning and ROC-AUC + PR-AUC for evaluation under class imbalance.

4. MovieLens recommendations (20M ratings)

ALS collaborative filtering plus k-means user-clustering to surface genre-preference profiles. Held-out test set evaluation with RMSE and top-N retrieval metrics.

Why this matters

Most ML coursework lives in single-machine notebooks. This project pushed the same modelling discipline onto a real distributed-compute environment: SLURM job scheduling, partition tuning, memory-aware joins, and writing queries that survive when the data doesn't fit on one node.

What I'd do differently

Rebuild the traffic prediction step with a proper time-series cross-validator rather than a single split, and explore Spark Structured Streaming for the web-log mining pipeline so it could ingest live logs.