Memento: Run the Experiment

A few years ago I built a system to create social networks by mining online conversations. The idea was that from these networks I could determine (amongst other things) who was talking about a particular topic, and the relative influence of each speaker.

I've been interested in the science of social networks ever since, and so I've really enjoyed re-acquainting myself with the theory and practice of networks during Lada Adamic's excellent Social Network Analysis course.

Then a few weeks back, a question about the practical application of social networks appeared in my Quora feed: "How did Google's creators realise the PageRank algorithm would be a useful indicator of good search results?"

It's a good question, because it goes back to the dawn of the information age, when pioneering thinkers had begun to contemplate how to make sense of the massive amount of knowledge that would inevitably become accessible. Vannevar Bush's classic article As We May Think suggested using a citation index to find influential information. Then in the 50s, Eugene Garfield developed the idea of impact factors, using citations to empirically calculate the influence of academic papers.

But citation analysis is not trivial to calculate, especially if the documents in question don't even have an index, and are literally dispersed across the globe. So the first internet search engines simply counted keywords; it was the pragmatic thing to do.

One in 100 million: how do you find what you're looking for? (with metaphor by Ai Weiwei)

It turns out finding things is hard, especially when there's a lot to sift through. And so here begins the fascinating story of the PageRank algorithm - and Google - the company formed to exploit it, which is nicely told in the first few chapters of John Battelle's book, The Search.

It began with a brave, complicated, highly ambitious experiment:

"In graduate school at Stanford University, I had about ten different ideas of things I wanted to do, and one of them was to look at the link structure of the web. My advisor, Terry Winograd, picked that one out and said, 'Well, that one seems like a really good idea.' So I give him credit for that." -- Larry Page

So whilst the idea of citation analysis wasn't new, what was novel was the ambition of applying it to the whole worldwide web, which by 1996 was at least 10 million documents, and growing rapidly.

In Battelle's book Terry Winograd recollects that what he didn't say when recommending the topic to Larry Page was that he thought the task was impossible in practice. But knowing when to hold your peace, and trust your student to explore the topic without prejudice, is the hallmark of a great mentor.

And Winograd was right not to dampen Page's enthusiasm, because once you start trying to implement citation analysis on a graph as gigantic as the web you're forced to confront all sorts of challenges.

For instance: one non-trivial problem is the graph you're trying to analyse is enormous, far beyond what would fit into the memory of even the largest supercomputer. And Page and Brin were grad students without budgets at the time, they would have to make do with desktops they could beg and borrow.

But necessity is the mother of invention, and the constraints of their available computing resources became the impetus behind their solution. They would split the graph into smaller individually computable fragments. They would distribute each fragment to one of their processors, analyse it, collect the results centrally, and integrate them into an index.

As well as technical challenges, there were theoretical ones. For example, by nature citations form a directed graph, and because of the way websites are interlinked, it's easy for any analysis algorithm to get stuck in a closed loop. Page's solution was to introduce the concept of the 'Random Surfer', who'd visit a portion of a graph, explore it to build up a local picture of what's important, and then get bored after several clicks and switch to a random page in a completely new region of the network.

This approach to computing citation analysis using a divide and conquer approach was novel, because no-one had ever had the means or the motivation to analyse such a massive network. The breakthrough was making the problem tractable, creating an infrastructure and algorithm that allowed the network to be explored and understood in small fragments. This allowed the seemingly overwhelming challenge of analysing the web to be divided up. Then Page and Brin built a search engine called BackRub to utilise the new PageRank index, and by doing so were able to prove it would scale: the more they processed, the better their results.

Returning to the original question; when did they realize the PageRank algorithm would be a useful indicator of good search results?

"Page and Brin noticed that BackRub's results were superior to those from existing search engines like AltaVista and Excite, which often returned irrelevant listings. 'They were looking only at text and not considering this other signal' Page recalls. That signal is now better known as PageRank. To test whether it worked well in a search application, Brin and Page hacked together a BackRub search tool. It searched only the words in page titles and applied PageRank to sort the results by relevance, but its results were so far superior to the usual search engines - which ranked mostly on keywords - that Page and Brin knew they were onto something big." -- John Battelle

In other words, like all good scientists, Page and Brin knew PageRank would be great after they ran the experiment, saw the results, and saw how exceptional they were.

So as well as a fascinating story, there's a marvellous moral in the genesis of Google.
It's not enough to have a great idea.
You need to build it.
You need to prove it.

Run the Experiment

0 comments:

Post a Comment

my new project

Interesting New Finds

About Me

Labels

Blog Archive

Most Popular Posts