Via Paul Lamere, I came across this recent Google blogpost on large scale graph computing. I started reading, and quickly became excited by what I was hearing:
A relatively simple analysis of a standard map (a graph!) can provide the shortest route between two cities. But progressively more sophisticated analysis could be applied to richer information such as speed limits, expected traffic jams, roadworks and even weather conditions. In addition to the shortest route, measured as sheer distance, you could learn about the most scenic route, or the most fuel-efficient one, or the one which has the most rest areas. All these options, and more, can all be extracted from the graph and made useful — provided you have the right tools and inputs.
“Yes!” I thought. “Yes! I am finally starting to see a growing acknowledgement from one of the Search Majors that when you have a goal-oriented topic, to get from Point A to Point B, there isn’t just a single, most effective, most efficient route. A user might actually want to choose — explicitly choose via input tools — different pathways through all the potential waypoints.
And, by analogy to search results, a user might actually want to choose different ways of scrolling through a set of results other than the single-route, linear path ranked list, a topic that I’ve gone into at great detail in the past. The author continues:
The web graph is similar. The web contains billions of documents, and that number increases daily. To help you find what you need from that vast amount of information, Google extracts more than 200 signals from the web graph, ranging from the language of a webpage to the number and quality of other pages pointing to it.
“Ok,” I thought, “this is it! Here’s where we’re finally going to get the Google philosophy for providing different ways to traverse from Point A to Point B! They’ve held out publicly discussing this for years. One almost wonders if they even have a philosophy or strategy for opening up new routes! But they must, right? How very exciting!”
And then it hit:
In order to achieve that, we have created scalable infrastructure, named Pregel, to mine a wide range of graphs. In Pregel, programs are expressed as a sequence of iterations. In each iteration, a vertex can, independently of other vertices, receive messages sent to it in the previous iteration, send messages to other vertices, modify its own and its outgoing edges’ states, and mutate the graph’s topology (experts in parallel processing will recognize that the Bulk Synchronous Parallel Model inspired Pregel).
Sigh. That’s not what the article was about, after all.
If you’re an infrastructure geek, it’s a good read and I would recommend it. The “thinking like a vertex” spoilers that they give makes me believe that they’re taking a very “shallow Markov blanket” sort of approach to all of their machine learning, which is an approach I don’t completely disagree with, having found it interesting enough to dabble with a little. Or maybe they’re doing some sort of loopy belief propagation on a lot of their information structures. Not for me to speculate too much, I suppose.
But if you think it’s just as (if not more) important to discuss the philosophical underpinnings of something as socially, politically, and culturally important as information retrieval, this is not the beginning of that discussion. I was hoping to get more insight into why, with the myriad possibilities opened up by the 200 signals available to the researchers, the focus is essentially still on finding only one route from Point A to Point B. That dialogue will have to wait for another day.
I’m not sure what “essentially still on finding only one route from Point A to Point B” means. Could you talk a little more about the discussion you were hoping to have?
They’re systems guys, not information scientists. PODC and SPAA–you’re not going to hear about exploratory search there. 🙂 Though amazingly you will if you come to this year’s SIGMOD!
@todd: A “route” in this analogy is the ordering in which the search engine returns results to you. Just as there isn’t only one route from Point A to Point B, so also might I not want to traverse the ranked list in the order that the search engine thinks is best for me. Over on Daniel Tunkelang’s blog, we’re discussing right now our observations about how Bing tends to return popular, how-to, and business-related links toward the top of the list, and Google tends to return academic, scholarly, journal-oriented articles. On Bing, there is only one route through the results: The popular, how-to route. On Google there is only the academic route. What I want to be able to do is tell the engine, “Hey, no! For this query, find me a different route. I know all my other queries about information seeking and retrieval tend to favor the academic route. But right now I want to learn about the aurora borealis, without having to learn a complex technical jargon. So the academic article from the journal of climatology, topically relevant though it may be, is not the particular route that I want to take through the search engine at this moment. So let me choose another route.”
Does that make more sense? The analogy to maps should be clear. Lots of times, you want the shortest-time route. But some times, you also want to be able to take the scenic route. The search engine should give you the option of explicitly choosing.
That makes a lot of sense Jeremy, thanks. I think of it as traversing relationships. Entities relate to other entities in a set of ways and you traverse those relationships depending on what interests you. I even think identity is established by those relationships.
I did not know the different relationship specializations of Bing and Google. That’s interesting and it would be useful to expose those relationships as first class entities.
As far as I know (which I don’t), those different relationship specializations are not something that Google and Bing consciously strive for. I think it is something that unconsciously arises out of unrelated design decisions. But yes, I agree that, whether or not the specializations are intentional, they should be made first-class, and therefore alterable.
@Daniel — Yeah, it doesn’t mention PODC and SPAA until the very end. They built up to the punchline very differently in the beginning, from what it turned out to be in the end. Oh well.