Sunday, 30 September 2012

Coordinates

Annoyingly, the Opta position coordinates are in units of percent. In the x-direction zero (%) is the defending goal line, and 100 (%) is the opposition goal line. Then, in the y-direction (cross-field) the "near" side is zero (%), and the "far" side is 100 (%).

I can see the reason for using percent, rather than a typical distance unit (metres, yards, chain, nanometres), not every pitch has the same dimensions. However, it means you need to be very careful when calculating anything related to distance.

Anyway, groan....

Saturday, 29 September 2012

Git

One of the most important tools for software engineering is version control (12, 3). Basically it's a method for keeping track of the changes you make to your work. Apple has recently introduced a similar function in OSX Lion, where the operating system automatically saves each version of a document you are working on.

Version control is a magnificent tool for software engineering because it acts like a backup of your work, change something and you can easily revert back to a previous version; and it lets you easily see the changes you are making as a difference between saved and working versions.

Anyway, the creator of Linux also made a revision control system called "Git" (after himself apparently). This is all the rage now, so it's a useful opportunity to learn it. At work I use an SVN based system.

Some instructions: http://gitref.org/

Pass-length distribution (First plot!)

Here's the first plot I've made from the Man City data. It's the result of my first play with the data, and is very much hard-wired to this purpose. I'm hoping to write some more general code to making analysis/plotting a bit easier.

Green: successful passes; Red: unsuccessful passes


The plot compares the distributions of pass-lengths between Manchester City and Bolton. To be honest I'm not sure it tells much of a story. Here are some conclusions:

  • Bolton passed the ball less, and completed far less 10 to 25 yard passes then Man City;
  • Bolton attempted more long range passes (40+ yards), and failed to complete them;
There are a few ways this could be progressed:
  • Pass-length distribution for each player; this would not be much more difficult than what I've already done (code wise). It might pinpoint a particular player who is having a good/poor game. 
  • Calculating the distribution over many games, either on a player or team basis would probably be interesting.


Sunday, 23 September 2012

Python XML

First techie post.

The advanced data set is data for a single match (Bolton v Manchester City, 21-08-12), and is in XML format. This is just a way of organising the data and is used a lot for webpage data and all sorts of other stuff. It's a way of specifying the data, it's not an actually tool for processing the data.

I'm going to be writing all my code in Python, it's my current language of choice. There are numerous place that describe and teach Python, so google it. Here's a webpage describing the built in XML parsing modules in Python. It looks quite useful, but I've not used it in anger yet.

Adidas MiCoach

An interesting article appeared in Wired a few weeks ago: Soccer Embraces Big Data to Quantify the Beautiful Game. It discusses the use of the Adidas MiCoach SPEED_CELL in the MLS draft games. The device contains an accelerometer and a GPS (although the commercially available version doesn't have a GPS; I've got one).

Basically, you get a players speed, and potentially position, displayed on an ipad on the sidelines. You could also derive various real-time statistics, they discuss a "power" metric. They suggest that football would be more popular in America if they could statistically quantify what is going on. Apparently Americans love their stats: RBIs, free-throw percentages, yards gain, and all that.

I'm pretty interested in real-time analysis of player and team performance. I'd be interested to know how much data is used in real-time by managers/coaches, I suspect almost none. I'm not sure how much value it could add to in-match decisions, but making yourself more likely to "win" by a few percent could be advantageous over a whole season.

Man CIty analytics data

Manchester City have released two datasets containing football data: "Full data" and "Advanced data". Not the most informative names, but I'm sure it makes sense to somebody. The first set (Full data) contains the Opta statistics for every player in every game from the 2011-12 season. This contains details of numerous metrics: assists, goals, headers, tackles, etc.

The seconds set (Advanced data) contains match events with spatial and temporal coordinates. There are 65 different event types, and each event has various qualifiers. The events include: pass, tackle, goal, cards, start & end of match, and on.

I'm more interested in the Advanced data, because another list of the most effective (defensive | attacking) player based on grossed up statistics would be a bit dull.

The aim of this blog is to track my progress with analysing the data and the (coding) techniques I'm using. Hopefully I'll manage to post some results too.


If anybody is interested in accessing the data, they should register with Manchester City Analytics.