Large data can be extremely effective, but how widely applicable is it, really?
A week or two ago the blogosphere was abuzz with discussion about the Unreasonable Effectiveness of Data position paper by Googlers A. Halevy, P. Norvig, and F. Pereira. I had my own commentary, but some great discussion came when Peter Norvig jumped in to the comments section of Daniel’s blog and clarified some of his points. I have decided now to write a few followup posts on this topic, as it touches all sorts of information seeking behaviors and domains, from music recommendation to web search to enterprise search to exploratory search.
And what is the topic, exactly? It is the idea that you can rely on vast quantities of data to help you solve problems that otherwise would take a much more complicated algorithm. By worrying about gathering more data, first, before worrying about creating fancy algorithms, a problem becomes much easier to solve. I generally agree that this approach works for a few, choice problems and topics. The small head of sub-problems on the small head of main problems. But my main concerns with relying on large-data solutions as one’s primary approach to research is that:
- There are a larger number (longer tail) of problems for which this approach will never work because there will never be enough data available, and
- There are a larger number (longer tail) of sub-problems within a main problem for which this approach will never work. For example, home page finding (navigation) is but one of many types of web searches. Many web searcher needs are more informational rather than navigational in nature.
For example, PageRank is a large data method that combined HTML in-links (and the anchor text of those in-links) in order to weed out less popular and spammy web pages. And it works dandy for web searchers doing home page finding. (Web search is a “head” main problem, and home page finding (aka navigation) is a “head” sub-problem of that main problem.) But:
- How well does PageRank work for the hundreds of thousands if not millions of smaller collections that people search all the time, e.g. Enterprise and Desktop search? It works on the web, because there is enough data. But does it work in the Enterprise and on the Desktop? Interestingly, the sum total volume of information being searched by people across the world in desktop and enterprise settings likely adds up to more information than is found on the web, but everyone’s desktop and enterprise documents are their own silos and do not link to each other and likely never will. So the there is still a large problem (lots of information to search) but large amounts of inlink data will never be available. And,
- How well does it work for informational queries on the web scale? Unlike enterprise and desktop search, in which you will never have enough inlink data to make PageRank work, on the web you do have that data. However, the problem itself is now different. Informational information needs are very different from navigational needs. The latter usually requires a single relevant answer. The former requires dozens if not hundreds. So does the large data approach of PageRank it make it easier or more difficult to find information on a topic, when doing web search?
I will begin to address some of these questions using more concrete examples and pointers to research done by others. My goal is not to prove myself right, and/or others wrong, but rather to point out that there are shades of grey and potential pitfalls in relying solely on large data methods that at least require more awareness and discussion.