Case study / 2022 to 2023

Big Data Player Scouting

A PySpark pipeline that ranks footballers across five European leagues using UEFA event data. My MSc Big Data Analytics thesis, published and open sourced.

3 min read

Big Data Player Scouting thumbnail with football pitch outline

Scouting is a Data Science problem.

Modern football clubs scout players from a population of tens of thousands across leagues, ages, and positions. The recruitment problem is fundamentally a Data Science problem: rank players against each other on dimensions that actually predict on-pitch contribution, and surface candidates that match the playing style a club needs. My MSc thesis tested how big-data analytics could support that decision.

Goals and assists are lagging indicators that favour attackers and ignore off-ball work. Coaches know this. The infrastructure to do better at scale has been slow to arrive outside the biggest clubs. My thesis question: given event-level data across multiple leagues, can we rank and recommend players by signature of play rather than by raw output?

What I did with 500+ players and 100+ matches.

The data came from Pappalardo et al. (2019), a public event-level dataset covering the 2017 to 2018 season across La Liga, Serie A, the Premier League, the Bundesliga, Ligue 1, plus the 2018 World Cup and Euro 2016. I wrapped the raw events in PySpark to handle the volume cleanly, then engineered KPIs across multiple dimensions: ball progression, defensive contribution, attacking threat, discipline, and off-ball movement signals.

I layered a ranking model on top that weights the KPIs by position and style context, then built a recommendation layer that surfaced players with similar style signatures to a reference player. That last part is where the research earns its weight. Scouts care less about who the highest-rated player is and more about which players match a reference player's style signature at a transfer fee the club can justify.

What came out of it.

A working pipeline, a published thesis in the Turkish National Thesis Centre, and a public GitHub repo. The thesis itself found several previously unremarked correlations between player actions and team style of play, which I walk through in the paper.

→ GitHub repository

→ Turkish National Thesis Centre