Website design, data consistency, and how analysis is hampered

One of my goals for this season is to expand the scope of my analysis efforts. For the last few seasons I’ve been able to capture not just Columbus data, but all of MLS. For 2016, my hope has been to expand that further and capture leagues like the NASL and USL. These efforts have been hindered, however, by a lack of consistency across sources – at least in the USL (the NASL begins play this weekend).

To illustrate the problem, here are three screenshots for the first game of the USL season: Bethlehem Steel FC at FC Montreal:

Links to the original source articles are:

It should be noted that the box score in this gallery was originally worse. When first published, data for this game was even more inconsistent. Credit to the USL for attempting to fix the problem, but as we will see there are still problems.

Poor box score design

The first and most fundamental problem with the box score is that starters are not identified. Without this information, it is impossible to conduct any analysis based on in-game events because one cannot determine who was on the field at a given moment. I’ve color-coded the apparent substitution pairs in the image above, working backwards from the fact that the game lasted 90 minutes. One such pair are Fabio Morelli (55 minutes played) and Yacine Ait-Slimane (35 minutes). Yet who replaced whom? Did Yacine suffer a first-half injury and need to be replaced? Or did Fabio last until ten minutes into the second half?

The goal in this game is noted as coming in the 43rd minute. One of the standard investigations I perform is similar to the plus/minus figure in hockey – but that isn’t possible without knowing who was on the field for a given goal.

Inconsistent game data

Thankfully, in at least some cases the individual teams are including a more traditional lineup notation in their game reports – although not every team – like Seattle – does this, so it isn’t a reliable data source. In the case of Bethlehem – Montreal, both teams included game data blocks in their match reports, but this only furthers the confusion.

Comparing the league box score with the teams’ game data, new problems are revealed. The box score appears to indicate that Mastanabal Kacher (66 minutes) swapped places with Heikel Jarras (24 minutes). Yet both teams claim that this wasn’t the case, and that Kacher was instead replaced by Charles Joly in the 67th minute.

This problem is more noticeable in the case of Jacques Haman, who the league claims played 67 minutes – and swapped with Charles Joly, but both teams claim played 71 minutes before leaving for Heikel Jarras.

A further wrinkle is are mismatches between the two teams’ game data. Did James Chambers’ yellow card come in the 75th minute or the 74th? By what name should we refer to the Montreal midfielder: Louis Beland, or Louis Beland-Goyette?


A third problem, although one that doesn’t affect game analysis, is the lack of attendance information in any published information. No league announcement appears to include this information. Of a handful of team stories I’ve inspected, only Arizona seems to have released any figure. My hope is that Kenn Tomasch will be able to uncover this data, if the USL decides not to release it.

Scouting the “path to MLS”

The USL is not the only league to suffer these sorts of problems. I’ve come across inconsistent figures in MLS data, although this is usually restricted to third-party data sources. At least the MLS box score design avoids the most egregious of these problems – identifying the starting lineup, and being internally consistent with other published reports. The game data blocks at the bottom of some game reports are not something an analyst can use at scale, because they are basically just very terse words written in a text editor – not drawn from a central data source.

I would really like to include the USL in my analysis efforts this season. The extensive relationships between MLS and USL teams mean that players’ performance in that league offer a path up to the first division. Without a reliable way to retrieve data about what happens in these games, however, it is hard to understand how players are progressing along this “path to MLS”.

The bounds of creativity


Sometimes, creativity needs to be bounded. This is a lesson I’ve been learning (or perhaps re-learning) lately, and can be seen specfically in the display of players on this site.

When MRData launched, and through even today, the list of all time players has been rendered using a platform based on Isotope – which renders items in a grid:Player roster as grid

This is very clever, but as time wore on, it became clear that “players as elements, rosters as a periodic table” isn’t a very useful metaphor. One of my goals for this year is to improve on this design, and the changes so far are looking awfully familiar:
Player roster as tableSometimes, it isn’t necessary to completely rethink a display. Sometimes, the traditional method of displaying a set of information (in this case, a table) is generally okay, and only simple tweaks are needed (in this case, depicting a player’s career using a visual timeline).

Looking At Player Sizes

A week or so ago a friend of mine posted this chart of NHL player sizes, and it struck a chord with me. So, I tweaked it a bit, and turned it into an interactive tool showing MLS players.

FireShot Screen Capture #007 - 'Massive Report } Data } Roster Visualizer' - www_massivereportdata_com_visualize_rosterClicking the image above will show you the Crew’s roster at full scale, while the complete tool can be found here. More information about this can be found after the jump.

Continue reading Looking At Player Sizes

Exploring Relationships Between Players and Games

Player-Game graph for 1996 Crew season, highlighting Brad FriedelTeam rosters evolve over time. Between seasons there can be significant turnover, but on occasion a team will also undergo dramatic changes during the course of a season. For teams with a large roster, more subtle variations can emerge: squad rotations, and players who appear in one competition but not another.

Continue reading Exploring Relationships Between Players and Games

Visualizing Players … as books?

I had some time this morning as I waited to leave for work, so I put together a quick test of StackView, a tool developed at Harvard to allow for easier browsing of library collections. I’m still working out some kinks, but this may end up replacing the current card-based display for players on this site:

Soccer players visualized as books
Soccer players visualized as books

Continue reading Visualizing Players … as books?