Sunday, 28 October 2012

Passing networks

My dad sent me a link to an eBook called Objective Barcelona : How To Beat The Most Powerful Team In The World. It was free a few days ago, sorry for not advertising it sooner! It was quite an interesting read, it's a very extended essay, rather than a book. It was very editorial, in that it didn't have much data to back up it's arguments.

For me the most interesting topic in the book, was the discussion of Sergio Busquets. It suggested that cutting off the lines of passing to Busquets from the defense, forces Xavi to drop deep to pick up possession. Sometimes they swap positions, which is good because Xavi is more dangerous higher up the field. This got me thinking about passing networks.

I've seen some diagrams of passing networks from the MCFC data before:



I'm a bit fan of simplicity, so I think a first order attempt should be just plotting the passes. I'm sure plenty have people have done this already, but it's still informative.
Zero order figure. Blue lines show assists (passes leading directly to a goal). All passes successful and failed passes are included here.


My next idea was to start rounding the xy-coordinates to boxes, effectively binning the start and end points. It's not until we start rounding to the nearest 5% (10% in the y direction for scale) that we start to see structure. For the case where the boxes are 10x20% we've probably gone a bit too far for this particular data set.

Rounding to the nearest 5x10% box.
Rounding to the nearest 10x20% box

I think it's fair to say that Bolton play more directly, even in the non-rounded plot you can see the long passes towards the attacking line. It definitely looks like most of the lines go along the left to right axis, rather than up-down axis; especially compared to Man City.

This is another diagram which will be more interesting with more data. Hopefully with more data the main lines of passing will stand out. I also think there might be ways to make this more interesting: breaking the pitch into defined boxes, more detail in midfield; player-by-player analysis; some sort of directional analysis to show if the pass is arriving or departing at a particular point.

Tuesday, 9 October 2012

Roses

A rose plot is often used in weather science to describe the direction and strength of winds. Since Opta were kind enough to include the angle of each pass (in radians no less) with respect to the direction of play, I thought I'd create some "pass roses". These show the direction and distance of each pass a player made in the game.

I'm not convinced how much you should read into these plots, but that should never stop a armchair football pundit. Can't be worse than the MOTD lot ;). In these plots 0˚ points towards the opposition goal, and 180˚ points to the defending goal.
  • Sergio Agüero mostly passes backwards, which makes sense since he's probably the most advanced player. It looks like most of his failed passes are in the (vague) direction of the goal; 4 out of 6. Also, he doesn't pass the ball more than ~23 yards.
  • Kolorov mostly passes infield (not surprising for a left back; and Richards is similar but opposite).
  • Joe Hart passes short and wide much more often than Jääskeläine. When going long, both keepers are fairly unsuccessful.
  • Jääskeläine's rose shows that I've not plotted the (successful) patch quite right, and don't have time to work out a fix for it. I put the shading in for aesthetic purposes.
Click the images for bigger versions.


I'm not sure there is much future in pass roses, the lack of initial position information makes it a bit superficial. Over the course of the season, I suppose it might show that a player can't reliably pass in a particular direction ;).  Pretty unlikely though. Anyway, it was a bit of fun.

Monday, 8 October 2012

More maps...

This time I've done some maps of "duelling". I think that duelling is a bit more difficult to classify. The supporting documentation describes which fields are used to describe which metrics. For "Ground duels won" it uses: take on, foul, tackle, smother; with the success attribute. For "Ground duels lost" it uses: take on, foul, tackle, challenge, dispossessed; with the non-success attribute. These seem a bit asymmetrical. The aerial duelling is a bit more straight forward, there is an aerial (duel) event; the success attribute describes the outcome.

To create the maps below I have taken the events: ground duels won, and aerial duels. If the outcome was a success then the position is added to a list for that team; if it was a failure I add it to the opposition team's list. I'm not completely convinced that this is robust (due to the ground duel asymmetry), but there doesn't seem to be any double counting and it gives a general idea of what is going on.

In terms of the data each time I go to histogram/bin something I'm surprised by the sparseness of the data. As I suggested previously, I'm probably trying to bin the data too finely. If the data for the full season is released then the data volume will probably be better.

Anyway a final note before the plots, I've over-plotted the location of the events with circles and pluses. This was just to convince myself I'd got the signs correct.





Sunday, 7 October 2012

Interaction maps

I have created some 2d histograms of events for each player, or interaction maps (as I'm going to call them). The events are binned into 5x10 % boxes, using "numpy.histogram2d". These are only maps of events involving each player, and don't necessarily show a player's positioning throughout the match. A true positional "heat map" can't be created from the current data set, which is a bit disappointing since I wanted to look at player positions in depth.

It's probably a bit hard to draw conclusions from a map of events, except perhaps where a player is positioning himself to be in the game. A few things stand out:
  • Kevin Davies does a lot of stuff between the midfield and defensive lines. I assume he's dropping deep to pick up the ball, or receiving long balls. From the Man City team, it looks like Lescott and Gareth Barry are most likely to be dealing with this threat.
  • The Man City midfield is quite asymmetrical, favouring the left-hand side. Relying on Richards to play quite far up the pitch. 
Click on the picture to enlarge. The number of events is given in brackets after each player's name.


Wednesday, 3 October 2012

More on pass completion

I've done some more on the pass completion histograms that I first did. The first plot shows the pass completion as a percentage for the team, and the second breaks it down to each individual player.

Thanks to Neil for suggesting a percentage; it does look a bit tidier. However you do loose the magnitude information; for example a single long range completed pass, skews the information a bit. This is exacerbated by the relatively small number of passes (not in terms of a single football match but in terms of sampling, but perhaps I'm using too many bins?). To try and keep a sense of scale I've added the total number of successful and failed passes in the legends.

Apologies for the size of the second plot, there's a lot of information there.


Figure 1: Pass completion for the teams

Figure 2: Pass completion for each player. 

Monday, 1 October 2012

Pitches

I wrote a routine to draw a pitch using matplotlib. This is so I can plot the spatial distribution of a variable. This also highlighted the problem with coordinates in units of percent; when I tried to draw the centre circle it came out as an ellipse because I had changed the aspect ratio of the pitch.

Anyway, here is a collection of Andy Warhol inspired pitches...

Sunday, 30 September 2012

Coordinates

Annoyingly, the Opta position coordinates are in units of percent. In the x-direction zero (%) is the defending goal line, and 100 (%) is the opposition goal line. Then, in the y-direction (cross-field) the "near" side is zero (%), and the "far" side is 100 (%).

I can see the reason for using percent, rather than a typical distance unit (metres, yards, chain, nanometres), not every pitch has the same dimensions. However, it means you need to be very careful when calculating anything related to distance.

Anyway, groan....

Saturday, 29 September 2012

Git

One of the most important tools for software engineering is version control (12, 3). Basically it's a method for keeping track of the changes you make to your work. Apple has recently introduced a similar function in OSX Lion, where the operating system automatically saves each version of a document you are working on.

Version control is a magnificent tool for software engineering because it acts like a backup of your work, change something and you can easily revert back to a previous version; and it lets you easily see the changes you are making as a difference between saved and working versions.

Anyway, the creator of Linux also made a revision control system called "Git" (after himself apparently). This is all the rage now, so it's a useful opportunity to learn it. At work I use an SVN based system.

Some instructions: http://gitref.org/

Pass-length distribution (First plot!)

Here's the first plot I've made from the Man City data. It's the result of my first play with the data, and is very much hard-wired to this purpose. I'm hoping to write some more general code to making analysis/plotting a bit easier.

Green: successful passes; Red: unsuccessful passes


The plot compares the distributions of pass-lengths between Manchester City and Bolton. To be honest I'm not sure it tells much of a story. Here are some conclusions:

  • Bolton passed the ball less, and completed far less 10 to 25 yard passes then Man City;
  • Bolton attempted more long range passes (40+ yards), and failed to complete them;
There are a few ways this could be progressed:
  • Pass-length distribution for each player; this would not be much more difficult than what I've already done (code wise). It might pinpoint a particular player who is having a good/poor game. 
  • Calculating the distribution over many games, either on a player or team basis would probably be interesting.


Sunday, 23 September 2012

Python XML

First techie post.

The advanced data set is data for a single match (Bolton v Manchester City, 21-08-12), and is in XML format. This is just a way of organising the data and is used a lot for webpage data and all sorts of other stuff. It's a way of specifying the data, it's not an actually tool for processing the data.

I'm going to be writing all my code in Python, it's my current language of choice. There are numerous place that describe and teach Python, so google it. Here's a webpage describing the built in XML parsing modules in Python. It looks quite useful, but I've not used it in anger yet.

Adidas MiCoach

An interesting article appeared in Wired a few weeks ago: Soccer Embraces Big Data to Quantify the Beautiful Game. It discusses the use of the Adidas MiCoach SPEED_CELL in the MLS draft games. The device contains an accelerometer and a GPS (although the commercially available version doesn't have a GPS; I've got one).

Basically, you get a players speed, and potentially position, displayed on an ipad on the sidelines. You could also derive various real-time statistics, they discuss a "power" metric. They suggest that football would be more popular in America if they could statistically quantify what is going on. Apparently Americans love their stats: RBIs, free-throw percentages, yards gain, and all that.

I'm pretty interested in real-time analysis of player and team performance. I'd be interested to know how much data is used in real-time by managers/coaches, I suspect almost none. I'm not sure how much value it could add to in-match decisions, but making yourself more likely to "win" by a few percent could be advantageous over a whole season.

Man CIty analytics data

Manchester City have released two datasets containing football data: "Full data" and "Advanced data". Not the most informative names, but I'm sure it makes sense to somebody. The first set (Full data) contains the Opta statistics for every player in every game from the 2011-12 season. This contains details of numerous metrics: assists, goals, headers, tackles, etc.

The seconds set (Advanced data) contains match events with spatial and temporal coordinates. There are 65 different event types, and each event has various qualifiers. The events include: pass, tackle, goal, cards, start & end of match, and on.

I'm more interested in the Advanced data, because another list of the most effective (defensive | attacking) player based on grossed up statistics would be a bit dull.

The aim of this blog is to track my progress with analysing the data and the (coding) techniques I'm using. Hopefully I'll manage to post some results too.


If anybody is interested in accessing the data, they should register with Manchester City Analytics.