The Social Network Analysis of Football Matches

  • "Football is a simple game based on the giving and taking of passes, of controlling the ball and of making yourself available to receive a pass. It is terribly simple." -- Bill Shankly
  • "Football is simple. But nothing is more difficult than playing simple football." -- Johan Cruyff

Did you know that several academics have used network analysis in an attempt to derive insights into the beautiful game?

During the 2010 World Cup, FIFA analysts compiled data listing every pass exchanged between players, and put it online (unfortunately it seems to have disappeared since then). This data was used by researchers at Queen Mary University of London to create a network theory explanation of how each team played. There's also a media-friendly summary of their work, and a more detailed write-up in the following paper:  

And the QMUL group aren’t the only researchers seeking to apply network analysis to football, there’s also the long-running ARSFutbol research group at the university of Buenos Aires. Some of their work looks very interesting, it's in Spanish, but Google Translate will help you read it in your own language.

I also found another paper analysing the 2010 World Cup from some Spanish researchers:

As this BBC article explains, most big clubs now have a team of data analysts, and companies like Prozone produce detailed data on every aspect of what happens on the pitch. Unfortunately clubs pay a lot of money for Prozone data, so it's unlikely to be publicly available; so if you're hoping to do some bedroom-based analysis, you might be out of luck.

Not all are convinced of the value of this type of analysis, mind you. Tottenham manager Andre Villas-Boas said: “I have never used Prozone. I don't use it because I don't believe in it". Whereas other managers, like Rafa Benitez, are more keen on the data-driven approach.

That network theory might be applied to football is not particularly surprising. After all, football is a passing game, and interactions between players are fundamental to a successful team. Hence when we fans watch players playing well, we frequently call them “influential”. And network theory has proven to be especially good in determining who or what is influential amongst the millions of documents now online.

By comparison with the Byzantine complexity of online content, analysing what happens on a football pitch seems child's play. The graph of a football match will only have 22 players (nodes), a handful more if you include substitutes.

However, to paraphrase Cruyff: calculating the numbers is simple, but deriving useful insights from them - that's difficult. For instance, say you notice a lot of passes are going through your centre-forward (he has high betweenness in the network jargon), is that a good thing? Was that part of your team's plan? Or should you be yelling at your centre-forward to get into the bloody box and to start causing some fecking mayhem?
Does network analysis have any value to our understanding of football? Will it one day be able to quantify what football professionals have come to understand intuitively - demystifying it for the rest of us? Or perhaps the spirit of football will always defy analysis, generating as many opinions as there are fans watching it. After all, what model could ever hope to explain the Miracle of Istanbul...

Run the Experiment

A few years ago I built a system to create social networks by mining online conversations. The idea was that from these networks I could determine (amongst other things) who was talking about a particular topic, and the relative influence of each speaker.

I've been interested in the science of social networks ever since, and so I've really enjoyed re-acquainting myself with the theory and practice of networks during Lada Adamic's excellent Social Network Analysis course.

Then a few weeks back, a question about the practical application of social networks appeared in my Quora feed: "How did Google's creators realise the PageRank algorithm would be a useful indicator of good search results?"

It's a good question, because it goes back to the dawn of the information age, when pioneering thinkers had begun to contemplate how to make sense of the massive amount of knowledge that would inevitably become accessible. Vannevar Bush's classic article As We May Think suggested using a citation index to find influential information. Then in the 50s, Eugene Garfield developed the idea of impact factors, using citations to empirically calculate the influence of academic papers.

But citation analysis is not trivial to calculate, especially if the documents in question don't even have an index, and are literally dispersed across the globe. So the first internet search engines simply counted keywords; it was the pragmatic thing to do.

  • One in 100 million: how do you find what you're looking for? (with metaphor by Ai Weiwei)
It turns out finding things is hard, especially when there's a lot to sift through. And so here begins the fascinating story of the PageRank algorithm - and Google - the company formed to exploit it, which is nicely told in the first few chapters of John Battelle's book, The Search.

It began with a brave, complicated, highly ambitious experiment:

"In graduate school at Stanford University, I had about ten different ideas of things I wanted to do, and one of them was to look at the link structure of the web. My advisor, Terry Winograd, picked that one out and said, 'Well, that one seems like a really good idea.' So I give him credit for that." -- Larry Page

So whilst the idea of citation analysis wasn't new, what was novel was the ambition of applying it to the whole worldwide web, which by 1996 was at least 10 million documents, and growing rapidly.

In Battelle's book Terry Winograd recollects that what he didn't say when recommending the topic to Larry Page was that he thought the task was impossible in practice. But knowing when to hold your peace, and trust your student to explore the topic without prejudice, is the hallmark of a great mentor.

And Winograd was right not to dampen Page's enthusiasm, because once you start trying to implement citation analysis on a graph as gigantic as the web you're forced to confront all sorts of challenges.

For instance: one non-trivial problem is the graph you're trying to analyse is enormous, far beyond what would fit into the memory of even the largest supercomputer. And Page and Brin were grad students without budgets at the time, they would have to make do with desktops they could beg and borrow.

But necessity is the mother of invention, and the constraints of their available computing resources became the impetus behind their solution. They would split the graph into smaller individually computable fragments. They would distribute each fragment to one of their processors, analyse it, collect the results centrally, and integrate them into an index.

As well as technical challenges, there were theoretical ones. For example, by nature citations form a directed graph, and because of the way websites are interlinked, it's easy for any analysis algorithm to get stuck in a closed loop.  Page's solution was to introduce the concept of the 'Random Surfer', who'd visit a portion of a graph, explore it to build up a local picture of what's important, and then get bored after several clicks and switch to a random page in a completely new region of the network.

This approach to computing citation analysis using a divide and conquer approach was novel, because no-one had ever had the means or the motivation to analyse such a massive network. The breakthrough was making the problem tractable, creating an infrastructure and algorithm that allowed the network to be explored and understood in small fragments. This allowed the seemingly overwhelming challenge of analysing the web to be divided up. Then Page and Brin built a search engine called BackRub to utilise the new PageRank index, and by doing so were able to prove it would scale: the more they processed, the better their results.

Returning to the original question; when did they realize the PageRank algorithm would be a useful indicator of good search results?

"Page and Brin noticed that BackRub's results were superior to those from existing search engines like AltaVista and Excite, which often returned irrelevant listings. 'They were looking only at text and not considering this other signal' Page recalls. That signal is now better known as PageRank. To test whether it worked well in a search application, Brin and Page hacked together a BackRub search tool. It searched only the words in page titles and applied PageRank to sort the results by relevance, but its results were so far superior to the usual search engines - which ranked mostly on keywords - that Page and Brin knew they were onto something big." -- John Battelle

In other words, like all good scientists, Page and Brin knew PageRank would be great after they ran the experiment, saw the results, and saw how exceptional they were.

So as well as a fascinating story, there's a marvellous moral in the genesis of Google.
It's not enough to have a great idea.
You need to build it.
You need to prove it.

Twitter FeedRecent Twitterings

    Do we share an interest?

    c o n t a c t m e @ j a r o n c o l l i s . c o m