I was inspired the other day by something Steve Fenn (@StatHunting on Twitter) wrote about analytics in soccer:
IMO it’d be better if a club’s analytics 1st made sure metric was predictive, or at least repeatable.
— Steve Fenn (@StatHunting) July 28, 2015
This question of repeatability is something that resonated with me, so I started digging around a bit. While I can’t claim great familiarity with some of the advanced modeling that goes on around the soccer world, my starting point was a fairly simple question:
How repeatable is team success itself?
This question gets to the issue of parity – which is barely mentioned in many leagues, but is a hallmark of Major League Soccer. Some observers skewer the league for this focus, claiming that it will prevent a truly dominant club from emerging to challenge for CONCACAF titles (or beyond). Others argue that parity is a welcome change from leagues in which only a small number of teams are ever truly contenders.
But what does the data show?
I took the final standings for the last four years in Major League Soccer, and compared how each team did from one year to the next. This timeframe covers the beginning of the 34-game schedule, and provides 56 datapoints over three offseasons (2011-12, 2012-13, and 2013-14). I focused on points per game, but with an equal number of games in each season an analysis of points earned would have had the same result:
This chart shows how team performance changed from year to year over the studied period. A team’s performance in one year determines its placement on the horizontal axis, while its performance the following year determines its vertical position. Teams that achieve exactly the same success from one year to the next would fall on a diagonal line between the lower left and upper right corners of the plot.
As the trendline indicates (and is documented by the equation and R-squared value at upper left), there is almost no relationship between a team’s performance one year and the next. The R-squared value of 0.0681 exists along a spectrum from 0 (absolutely no relationship) to 1 (perfect correspondence).
Let me say that again.
An MLS team’s performance one year has very little relationship to its performance the following year.
While arguing from extremes can be risky, the poster child for this phenomenon is DC United from 2013. Ben Olsen’s team earned less than half a point for every game they played that year, and were among the worst-performing teams in league history. Yet the next year they won the Eastern Conference, and had the third-highest point total in the league. You can see that transformation in the lonely dot at the upper left of the plot above.
To help give context to this data, I then looked at final standings for the last several years of the Premier League and La Liga:
These leagues show a significantly stronger relationship between team performance in consecutive years. The R-squared value for these plots both hover around 0.65 – again on a scale of 0 to 1.
While the leagues show some distinguishing characteristics themselves – particularly the relative dominance of the duo of Barcelona and Real Madrid in La Liga, they are together worlds apart from MLS in terms of their predictability.
I have not applied this technique to other major leagues such as the Bundesliga or Serie A, nor more regionally relevant leagues such as Liga MX or Costa Rica’s Primera Division. While that would yield greater context, the contrast between MLS and the best-known European leagues seems to be well established.
The question at hand is whether the causes of this turnover might be identified – and whether that analysis can inform the choices that a team must make. While it appears to be true that Major League Soccer starts each season with a blank slate, it is also true that some teams – such as the Los Angeles Galaxy – seem to be perennial contenders, while others (the recently-folded Chivas USA, or until this year Toronto FC) repeatedly underachieve.
I don’t have these answers yet. Among the factors I would like to investigate, however, are these:
- What impact does scheduling play in team performance? MLS plays an unbalanced schedule, which deviates from the practice of many leagues. To what degree does this imbalance influence final standings?
- Player salaries are the subject of regular press releases from the MLS Player Union, and have been the subject of much discussion recently. Great attention has been paid to the narrative of “haves and have nots” in MLS, with teams like Los Angeles and Toronto threatening to outspend their rivals in a salary arms race. What sort of impact can be traced to the presence (or absence) of high-salaried players that take up a significant chunk of a player budget?
- What is the value of roster stability over the span of several years? Is there an ideal amount of turnover to which club leaders should aspire? Is it preferable to identify a core of players and tinker around the edges, or is it better to remake an underperforming roster after every season?
If you have suggestions about these or other long-term topics, I’d love to hear them.
After posting the first set of three plots on Twitter last night, I was encouraged to look at two other leagues: Australia’s A-League, and the modern incarnation of the NASL:
Each are significantly smaller than the other three leagues, and so the plots have relatively fewer data points. This is particularly true of the NASL, which in addition to fewer teams also plays fewer games in its schedule.
The data show that these leagues have even lower levels of repeatability than does MLS. Each league’s R-squared value is a miniscule 0.01.
These plots raise an interesting question of whether closed leagues naturally produce greater variability in team performance. I’m not ready to embark on that investigation, but would welcome reading about an attempt by someone else.