Over the course of this season I’ve been occasionally running simulations of how the rest of the Major League Soccer season might end up. The feedback I’ve gotten from this work has been generally positive, and a number of people have asked about my methodology. This post is, finally, my attempt to explain the process I’ve been following.
The TL;DR version is this: I’m running a Monte Carlo simulation that randomly assigns a result to each remaining league game.
One of my goals for this season is to expand the scope of my analysis efforts. For the last few seasons I’ve been able to capture not just Columbus data, but all of MLS. For 2016, my hope has been to expand that further and capture leagues like the NASL and USL. These efforts have been hindered, however, by a lack of consistency across sources – at least in the USL (the NASL begins play this weekend).
To illustrate the problem, here are three screenshots for the first game of the USL season: Bethlehem Steel FC at FC Montreal:
It should be noted that the box score in this gallery was originally worse. When first published, data for this game was even more inconsistent. Credit to the USL for attempting to fix the problem, but as we will see there are still problems.
Poor box score design
The first and most fundamental problem with the box score is that starters are not identified. Without this information, it is impossible to conduct any analysis based on in-game events because one cannot determine who was on the field at a given moment. I’ve color-coded the apparent substitution pairs in the image above, working backwards from the fact that the game lasted 90 minutes. One such pair are Fabio Morelli (55 minutes played) and Yacine Ait-Slimane (35 minutes). Yet who replaced whom? Did Yacine suffer a first-half injury and need to be replaced? Or did Fabio last until ten minutes into the second half?
The goal in this game is noted as coming in the 43rd minute. One of the standard investigations I perform is similar to the plus/minus figure in hockey – but that isn’t possible without knowing who was on the field for a given goal.
Inconsistent game data
Thankfully, in at least some cases the individual teams are including a more traditional lineup notation in their game reports – although not every team – like Seattle – does this, so it isn’t a reliable data source. In the case of Bethlehem – Montreal, both teams included game data blocks in their match reports, but this only furthers the confusion.
Comparing the league box score with the teams’ game data, new problems are revealed. The box score appears to indicate that Mastanabal Kacher (66 minutes) swapped places with Heikel Jarras (24 minutes). Yet both teams claim that this wasn’t the case, and that Kacher was instead replaced by Charles Joly in the 67th minute.
This problem is more noticeable in the case of Jacques Haman, who the league claims played 67 minutes – and swapped with Charles Joly, but both teams claim played 71 minutes before leaving for Heikel Jarras.
A further wrinkle is are mismatches between the two teams’ game data. Did James Chambers’ yellow card come in the 75th minute or the 74th? By what name should we refer to the Montreal midfielder: Louis Beland, or Louis Beland-Goyette?
A third problem, although one that doesn’t affect game analysis, is the lack of attendance information in any published information. No league announcement appears to include this information. Of a handful of team stories I’ve inspected, only Arizona seems to have released any figure. My hope is that Kenn Tomasch will be able to uncover this data, if the USL decides not to release it.
Scouting the “path to MLS”
The USL is not the only league to suffer these sorts of problems. I’ve come across inconsistent figures in MLS data, although this is usually restricted to third-party data sources. At least the MLS box score design avoids the most egregious of these problems – identifying the starting lineup, and being internally consistent with other published reports. The game data blocks at the bottom of some game reports are not something an analyst can use at scale, because they are basically just very terse words written in a text editor – not drawn from a central data source.
I would really like to include the USL in my analysis efforts this season. The extensive relationships between MLS and USL teams mean that players’ performance in that league offer a path up to the first division. Without a reliable way to retrieve data about what happens in these games, however, it is hard to understand how players are progressing along this “path to MLS”.