Over the weekend I described the process I’m using to simulate MLS seasons, and the odds that a team reaches the playoffs. After using this tool for most of this year, I’ve recently tried extending this work to look at how team projections change during a season.
Over the course of this season I’ve been occasionally running simulations of how the rest of the Major League Soccer season might end up. The feedback I’ve gotten from this work has been generally positive, and a number of people have asked about my methodology. This post is, finally, my attempt to explain the process I’ve been following.
The TL;DR version is this: I’m running a Monte Carlo simulation that randomly assigns a result to each remaining league game.
I recently shared on Twitter an updated visualization that summarizes the outcome of all the expansion drafts in MLS history. Particularly now that the league has 20+ years of history, and has more than doubled in size since its inception, it seems worth preserving the history of events like the expansion draft.
Continue reading Expansion Draft Outcomes
While Toronto and Seattle are preparing to face off in MLS Cup, the rest of the teams in Major League Soccer are turning their attention towards next season. Most immediately, teams are making decisions about which players should return and which should be let go. Continue reading Roster Turnover across Major League Soccer, 2013 – 2015
Given the emphasis that Columbus coach Gregg Berhalter places on possession and passing, it probably isn’t a surprise that I’ve started to focus on those statistical categories. Throughout the 2016 season, as troublesome results started to accumulate, I’ve tried to understand the Columbus approach to possession and passing within the context of other teams in MLS.
Now including all 2016 MLS games. Sorted box-and-whisker plots of team possession %. pic.twitter.com/VFu6jU3WL7
— Matt Bernhardt (@bernhardtsoccer) August 11, 2016
— Matt Bernhardt (@bernhardtsoccer) September 10, 2016
I’m hardly alone in this, of course – which is one of my motivations in writing this post.
Now that the 2016 MLS league season has concluded, I combed through the stats pages for each game and recorded a series of data points. The resulting dataset has been posted to GitHub.
Each observation is a single team’s performance in a single game – so with 20 teams each playing 34 games, there are 680 rows available. Fields include:
- Possession %
- Pass completion %
- Passes in specific areas of the field (attacking half, final third, and crossses)
- Pass completion % in those areas
- Shots on target
Hopefully the work to assemble this data proves useful to someone. I’ve been using this data for many of the plots that I’ve shared on Twitter this season, and now that Columbus is done for the year I’ve been exploring it in more detail.
I hope to be able to share what I’ve found over the coming weeks, but for now I mostly just want to see whether anyone is interested in the data itself.
Here are some sample plots that I’ve been working with that are generated by this data. More information will be shared in future posts.
Every so often I go back and refresh this chart – one of the first I ever tried – that plots players used in World Cup cycles (both qualifying and tournament finals) by the US men’s national team from 1998 through the present. With the 2018 Hexagonal set to begin next month, this seems like a good time for the latest iteration.
More details after the jump.
The 2016 season for Columbus Crew SC is turning into one of the hardest in team history. A year after hosting MLS Cup, the team is mired near the foot of the Eastern Conference standings. The collapse is threatening to set several team records for futility, including lowest number of victories and the fewest points earned in a season.
Recently, I was asked via Twitter to look into another possible mark of futility:
— Flick (@mikeflick) August 5, 2016
The answer, sadly, is that this year’s team will at least equal the longest winless stretch in history. Should they fail to defeat New York City FC (currently leading the Eastern Conference) on August 13th, they will own the mark outright.
The following table illustrates the longest winless streaks (solely looking at league play) during each calendar year.
|Season||Max Gap||Last Win||Next Win||Notes|
|1996||75||May 11||July 25|
|1997||41||May 11||June 21|
|1998||30||July 9||August 8|
|1999||42||May 15||June 26|
|2000||28||May 27||June 24|
|2001||35||April 14||May 19|
|2002||31||March 27||April 27|
|2003||56||June 28||August 23|
|2004||27||June 6||July 3||Columbus did not win its first game in 2004 until 42 days into the 2004 season.|
|2005||39||June 11||July 20|
|2006||77||June 3||August 19|
|2007||62||July 22||September 22|
|2008||35||May 10||June 14|
|2009||29||August 15||September 13||Columbus did not win its first game until 49 days into the 2009 season.|
|2010||50||September 4||October 24||The 2010 team’s 50-day winless streak ended when the season ended.|
|2011||43||August 20||October 2|
|2012||42||March 31||May 12|
|2013||35||March 23||April 27|
|2014||56||March 29||May 24||2014 had two different spells of 56 days between victories.|
|May 24||July 19|
|2015||46||May 9||June 24|
|2016||77||May 28||August 13||This assumes Columbus defeats New York City FC on August 13.|
Before 2016, the club record for days between victories in a single season was 2006, when Sigi Schmid’s inaugural squad went 77 days between victories on June 3 and August 19.
The 2006 season is commonly acknowledged as the worst in team history. The first of Schmid’s three years in charge saw so much player turnover from the Greg Andrulis years that Columbus was essentially an expansion club, and almost none of the players from that year were on the roster to lift MLS Cup two years later.
Interestingly, the season with the shortest winless streak – including season-beginning streaks – was 2000. This is ironic, given that the 2000 team was also the first to miss the playoffs. While their 11 victories were spread out evenly enough that they avoided any long doldrums, they also won no more than two games in a row during the season.
Other seasons of note in this summary were 2004 and 2009, which began with winless streaks (42 and 49 days respectively) that were longer than any seen during the middle of the season. Oddly, both of those years ended with Columbus winning the Supporters Shield. The 2004 team in somewhat infamous for reaching this mark on the strength of only five losses, but drew more often (13 times) than they won (12).
I should give a caveat that my records for 1996 – 1999 treat all games that went to a shootout as ties. Arguments can be made that this is inaccurate, but in the case of this particular question the difference is irrelevant. In 1996 Columbus did not win a game of soccer between May 11 and July 25 (a stretch of 75 days) – even though they twice won the post-game shootout during that spell.
One of my goals for this season is to expand the scope of my analysis efforts. For the last few seasons I’ve been able to capture not just Columbus data, but all of MLS. For 2016, my hope has been to expand that further and capture leagues like the NASL and USL. These efforts have been hindered, however, by a lack of consistency across sources – at least in the USL (the NASL begins play this weekend).
To illustrate the problem, here are three screenshots for the first game of the USL season: Bethlehem Steel FC at FC Montreal:
Links to the original source articles are:
It should be noted that the box score in this gallery was originally worse. When first published, data for this game was even more inconsistent. Credit to the USL for attempting to fix the problem, but as we will see there are still problems.
Poor box score design
The first and most fundamental problem with the box score is that starters are not identified. Without this information, it is impossible to conduct any analysis based on in-game events because one cannot determine who was on the field at a given moment. I’ve color-coded the apparent substitution pairs in the image above, working backwards from the fact that the game lasted 90 minutes. One such pair are Fabio Morelli (55 minutes played) and Yacine Ait-Slimane (35 minutes). Yet who replaced whom? Did Yacine suffer a first-half injury and need to be replaced? Or did Fabio last until ten minutes into the second half?
The goal in this game is noted as coming in the 43rd minute. One of the standard investigations I perform is similar to the plus/minus figure in hockey – but that isn’t possible without knowing who was on the field for a given goal.
Inconsistent game data
Thankfully, in at least some cases the individual teams are including a more traditional lineup notation in their game reports – although not every team – like Seattle – does this, so it isn’t a reliable data source. In the case of Bethlehem – Montreal, both teams included game data blocks in their match reports, but this only furthers the confusion.
Comparing the league box score with the teams’ game data, new problems are revealed. The box score appears to indicate that Mastanabal Kacher (66 minutes) swapped places with Heikel Jarras (24 minutes). Yet both teams claim that this wasn’t the case, and that Kacher was instead replaced by Charles Joly in the 67th minute.
This problem is more noticeable in the case of Jacques Haman, who the league claims played 67 minutes – and swapped with Charles Joly, but both teams claim played 71 minutes before leaving for Heikel Jarras.
A further wrinkle is are mismatches between the two teams’ game data. Did James Chambers’ yellow card come in the 75th minute or the 74th? By what name should we refer to the Montreal midfielder: Louis Beland, or Louis Beland-Goyette?
A third problem, although one that doesn’t affect game analysis, is the lack of attendance information in any published information. No league announcement appears to include this information. Of a handful of team stories I’ve inspected, only Arizona seems to have released any figure. My hope is that Kenn Tomasch will be able to uncover this data, if the USL decides not to release it.
Scouting the “path to MLS”
The USL is not the only league to suffer these sorts of problems. I’ve come across inconsistent figures in MLS data, although this is usually restricted to third-party data sources. At least the MLS box score design avoids the most egregious of these problems – identifying the starting lineup, and being internally consistent with other published reports. The game data blocks at the bottom of some game reports are not something an analyst can use at scale, because they are basically just very terse words written in a text editor – not drawn from a central data source.
I would really like to include the USL in my analysis efforts this season. The extensive relationships between MLS and USL teams mean that players’ performance in that league offer a path up to the first division. Without a reliable way to retrieve data about what happens in these games, however, it is hard to understand how players are progressing along this “path to MLS”.
Last summer, I published an article that examined the repeatability of success in Major League Soccer’s regular season. Using data from recent seasons, I concluded that “[a]n MLS team’s performance one year has very little relationship to its performance the following year.”
With the 2016 season starting tomorrow, it seems appropriate to build upon that analysis. This piece expands on last summer’s work in two ways. It does this first by adding all seasons of MLS to the data, and second by considering a new measure of success: advancement in the playoffs.
The data used in this article, and the R code to generate the illustrations, has been posted to GitHub. I encourage anyone interested to download the data and build upon it. The data is in CSV format for maximum portability.
Turning first to the league analysis, the plots below now include the final standings from all 20 seasons of Major League Soccer. There are 251 instances in league history of teams playing consecutive seasons*. Plotting all of these in a scatterplot, with first-year performance along the X axis and second-year performance along the Y axis, produces the following figure.
> model <- lm(data$PPG2 ~ data$PPG1) > summary(model) Call: lm(formula = data$PPG2 ~ data$PPG1) Residuals: Min 1Q Median 3Q Max -0.9891 -0.1992 0.0165 0.1830 0.7460 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.04390 0.08498 12.283 < 2e-16 *** data$PPG1 0.24374 0.06118 3.984 8.91e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2972 on 249 degrees of freedom Multiple R-squared: 0.05992, Adjusted R-squared: 0.05615 F-statistic: 15.87 on 1 and 249 DF, p-value: 8.907e-05
The addition of all historical data does not appreciably change the result. The R-squared coefficient for this model is a very small 0.0599, which is similar to the 0.0681 that was found last summer with a smaller dataset.
Generally speaking, the conclusion from recent seasons holds up well when all seasons of MLS are examined. Team success in one year is a fairly poor predictor of success the next year. This minimal relationship is unlikely to be a figment of the data, but it also leaves a large amount of variation unexplained.
Turning our attention to the playoffs, the dataset now includes information about how far each team advanced in the postseason. The measure is simple – how far did the team advance? Each team is recorded with one of six values, ranging from “Did Not Qualify” to “Champion”. For some parts of this analysis, numeric equivalents of these six levels are used, ranging from 0 (did not qualify) to 5 (champion). The distribution of these values is depicted below.
It should be noted, here, that the structure of the playoffs has changed at two different points. After being an eight-team competition for the first 15 years, in 2011 an octofinal round was added that expanded the field to ten teams. In 2015 the competition was expanded again, to include 12 teams. Because of these expansions, the count for teams advancing to the octofinal stage is relatively small.
When we examine how team success in the playoff changes from year to year, some interesting details emerge.
First, teams have demonstrated radical changes in fortune in the playoffs, but not significantly enough to say that past performance is completely meaningless. Teams that missed the playoffs in one year are – slightly more likely than not – going to miss the playoffs the next year. Teams that advanced to the quarterfinals or semifinals are probably going to reach the same general stage the next year (not missing the playoffs, but also not advancing to MLS Cup).
Yet beyond these very general statements, there are some interesting details. Teams that reach MLS Cup are only able to repeat that feat 20% of the time – and are eliminated before the quarterfinals at roughly the same rate. Curiously, no MLS Cup-winning team has ever been in eliminated in the semifinals the next year – they all either lose before then, or make a second MLS Cup.
Another interesting phenomenon is how frequently teams have surged to MLS Cup from previous disappointment. The percentage of teams to win the championship after either not making the playoffs, or falling in the quarterfinals, is relatively even – approximately 5%. This is in marked contrast to the difference in these teams’ likelihood of missing the playoffs altogether.
One final quirk relates to the chances of a team’s appearing in MLS Cup. Based on these data points, the group most likely to play in the championship game is not the reigning champion – but the defeated finalist. However, while approximately 25% of losing finalists repeat their appearance the next year, those teams that do repeat are still more likely to lose a second time than to finally claim the title. Blame the New England Revolution and Houston Dynamo for these data points.
The varying fates of teams that reached different stages of the playoffs are separated in the following gallery of histograms. Each plot focuses on one level of playoff advancement.
A note on significance
This last note also bears some explanation. For as much as there are now 20 years of history in this data, there are still relatively few data points behind some of these individual categories – so it is still somewhat likely that these are aberrations rather than an emergent trend. This stands in contrast to the linear model for league success, where we can be relatively certain that a team’s performance from one year to the next only explains about 6% of the variation in future performance.
With this analysis now expanded to include all of MLS history, we can say with relative certainty that team fortunes can change drastically from year to year. Yet this begs the question – what does determine team success? Are there factors that predict repeated success? How would we go about identifying them?
Thanks and footnotes
I owe a sincere thanks to Katrin Anacker for helping me work through the analysis of the linear model in this piece, and also to Jason Little and William Rand for their helpful feedback.
* The final seasons for Tampa Bay, Miami, and Chivas USA are obviously excluded. Additionally, I have chosen to exclude the final season of San Jose before their relocation in 2005, classifying Houston as an expansion team.
A few weeks ago I discussed the coding project which I’ve set myself for this offseason – to rebuild as much of my data processing infrastructure as possible. At the time the MLS postseason was still grinding along, however, so I was focused mostly on that rollercoaster.
Now, with MLS Cup complete and the offseason underway – and having had a few days to recover from the emotional turmoil of those first few minutes – my work can begin in earnest.
Let’s talk about Project Trapp