A few weeks ago I discussed the coding project which I’ve set myself for this offseason – to rebuild as much of my data processing infrastructure as possible. At the time the MLS postseason was still grinding along, however, so I was focused mostly on that rollercoaster.
Now, with MLS Cup complete and the offseason underway – and having had a few days to recover from the emotional turmoil of those first few minutes – my work can begin in earnest.
Let’s talk about Project Trapp
Trapp will be a Python package that will replace a good chunk of my data processing pipeline. The exact scope of the platform is something I’m still deciding, but I’m far enough along that I can at least give the highlights.
Trapp, as the code repository describes, will allow the user to link, analyze, and extend soccer-related data. It will operate in three broad categories:
The heart of my operation is a MySQL database that has been designed and populated over the last two decades. The design is generally set at this point, but it needs to be fed regularly.
Data generally gets injected along two timescales. Around once a year, I import a batch of games – usually in the spring when the league schedules are announced. Subsequent imports are run when the Open Cup and Champions League are formalized. Player records are also created, piecemeal during the regular season but also in bulk around the beginning of a new tournament.
The more frequent rhythm is that of game records themselves. About once a week, I import the lineups and goals for the games in my targeted competitions.
All told, the importation of data happens in four stages. Each of these is currently shepherded through a range of tools, including web browsers, OpenRefine, and Excel. The final step is a set of VBscripts that usually work. At the conclusion of this process, I have a set of database tables that are ready for analysis.
Now, the fun can begin. Once a week during the season, I kick off an hours-long process of compiling the basic data into a format that can be shared and extended. Thankfully this usually happens overnight, so on Sunday evening I kick the process off, go to bed, and then look over the results on my commute on Monday morning (have I mentioned that I love mass transit?)
Some of the derivative information that emerges from this process are an updated set of plus/minus figures for every player in the dataset, and a set of records detailing how often various players have appeared together. A byproduct of these calculations is a more traditional-looking stats table for all players, although frankly that information is usually found in better format elsewhere.
Once the heavy lifting is over, then the final process happens throughout the next week – making the graphics that I share online, or that allow a more rigorous analysis to proceed.
The format of these outputs vary – most are Excel spreadsheets, but some tools generate JPG images, JSON data files, or even vector graphics. The languages involved in this step vary widely as well, with VBscript and the odd PHP script joined here by Python, Processing, and R.
This step, frankly, is also the piece which may be least changed through this process. There’s a few specific scripts that may need to be re-written (like the waterfall plot), but if something has to give in this process, the rendering can be tackled later. Solving the import and compilation steps are thornier, and more pressing, challenges.
Progress So Far
Over the past few days, I’ve put together a project scaffolding that shows some promise. There’s a working game importer, and the heart of a player importer as well. I haven’t tested it thoroughly, in that I haven’t actually run an import of real data, but the automated testing is coming back clean.
If you want to look over the code, or try using it yourself, check out Project Trapp on GitHub. It requires Python 2.x, but if that proves problematic for others’ adoption then I’ll look at adding Python 3.x support.
My next steps will be to define a minimum viable product, and to draw up a basic roadmap for achieving that state. My hunch is that this will entail necessary importers, and at least one of the compilation and rendering steps.