Case study / 2023

Steam Games Recommender

Collaborative-filtering recommender built on PySpark with MLflow tracking, shipped on Databricks.

Steam Games Recommender thumbnail with three-node network graph

The full ML lifecycle, at scale.

Building a recommender system at scale is a different problem from building one on a laptop. The Steam 200k dataset has implicit feedback from real users at real volume, and using it well means moving the modelling onto a distributed engine and tracking experiments properly. This was the project where I shipped a full ML lifecycle: distributed ALS training, hyperparameter sweeps with cross-validation, and experiment tracking in MLflow.

ALS on Spark MLlib, with proper tuning.

Alternating Least Squares is the right algorithm for implicit feedback at scale, and Spark MLlib has a clean implementation. I set up the pipeline on Databricks Community Edition, which gave me enough compute to run a proper grid search rather than a token tuning step.

The parameter space covered rank, regParam, alpha, and maxIter. I wrapped the whole thing in CrossValidator, tracked every experiment in MLflow, and used the best-performing model to generate personalised recommendations for sample users.

Outcome.

A working recommender, a clean codebase, a reproducible pipeline, and a habit of never skipping hyperparameter tuning just because it is tedious.

→ GitHub repository