What You Can Find Out

The Edge has published their annual question for 2010:


As an Information Retrieval research scientist, I of course was quite interested in what search folks had to say.  I found this blurb from Marissa Mayer intriguing:

It’s not what you know, it’s what you can find out. The Internet has put at the forefront resourcefulness and critical-thinking and relegated memorization of rote facts to mental exercise or enjoyment. Because of the abundance of information and this new emphasis on resourcefulness, the Internet creates a sense that anything is knowable or findable — as long as you can construct the right search, find the right tool, or connect to the right people. The Internet empowers better decision-making and a more efficient use of time…

The Web has also enabled amazing dynamic visualizations, where an ideal presentation of information is constructed — a table of comparisons or a data-enhanced map, for example. These visualizations — be it news from around the world displayed on a globe or a sortable table of airfares — can greatly enhance our understanding of the world or our sense of opportunity. We can understand in an instant what would have taken months to create just a few short years ago. Yet, the Internet’s lack of structure means that it is not possible to construct these types of visualizations over any or all data. To achieve true automated, general understanding and visualization, we will need much better machine learning, entity extraction, and semantics capable of operating at vast scale.

It sounds like there is an increased awareness of (and respect for) Exploratory Search.  I’ve heard this via private channels, but this is the first time I’ve seen an acknowledgment of the need for more exploratory search from such an official channel.

I do want to point out, however, that in order to make this work at web scale, we won’t just need better automated methods.  I.e. we cannot rely solely on machine learning, entity extraction, or web-scale semantics.  Rather, what is also desperately needed is a way for the user him- or herself to inject personal semantics and structure into the search, visualization, and comparison process.  The search engine itself needs to be responsive to the structure that the user is giving to it, and rearrange itself around that information.

I am afraid that I am not being very clear in the vision that I’m attempting to lay out, so let me draw an analogy to parametric and non-parametric statistical modeling.  In parametric modeling, you assume that your data is distributed according to some function (say, Gaussian) and then you try and find those parameters that best fit the data.  On the other hand, with non-parametric modeling you make no such assumption.  You simply let the data describe itself through its own correlations and patterns.

By analogy: Assuming that the only way to visualize and compare information (do exploratory search) on the web is to rely on machine learning to do entity extraction and web-scale semantics is like assuming that one has to have a parametric model.  It helps, but it is not absolutely necessary.  My vision is for another approach, one analogous to non-parametric methods: Let the user give feedback on the relationship between items that he or she has examined during the search process and then use that comparison information to build personalized visualization or comparison tool for that user’s specific information need, from the ground up.  Don’t rely on the parametric form of semantic categories or named entities.  Use bottom-up patterns to facilitate organization and comparison, discovery and learning, decision making and exploration.  More importantly, use the feedback provided by the user (e.g. “these two items are similar”, and “these two items are not”) to drive your online, bottom-up exploration.

We have to get away from this attempt to solve the exploration problem ahead of time, off-line, before the user has ever issued a query.  That’s the parametric way of thinking, the way that presumes that categories and labels and entities are the best way of tackling organization and discovery.  Rather, we have to become better at involving the user, the person doing the exploration, in the feedback loop, and not rely solely on pre-computed, machine-learning-extracted entities.

Unlike navigational search, in which users are rarely willing to do any extra work themselves, users engaged in exploratory search by their very nature desire to interact more with the system and put more of their own sweat and tears into the search process.  They would not be exploring, if they weren’t.  So why not make use of this user willingness?

Computational resources are going to be a challenge.  But that’s where Google’s new commitment to openness (and Yahoo!’s initial, existing commitment) comes in handy.  There should be a willingness to offload some of the computation (and therefore also the search data itself) to the user’s own computer.  Instead of SETI@Home, we could have SEARCH@Home.  Let the user’s underutilized processing power be partially responsible for computing some of these bottom-up patterns in his or her own search data that will help make dynamic visualization a reality.  Make the user’s own computer partially responsible for the additional necessary processing.

Mayer is correct: “The Internet has put at the forefront resourcefulness and critical-thinking and relegated memorization of rote facts to mental exercise or enjoyment. Because of the abundance of information and this new emphasis on resourcefulness, the Internet creates a sense that anything is knowable or findable — as long as you can construct the right search, find the right tool, or connect to the right people.”  We should be developing systems that enable the users to construct the right search. The user should be able to rely on our her resourcefulness to mash up and explore the data herself, to shed light on patterns of information hitherto unknowable by single-line input box navigational search.  Users should be able to apply critical thinking to their search process in a way that makes sense to the user, not in a way that has been pre-computed through some semantic category and machine learning classifier.  And a good search engine should be a valuable partner in this process, by way of flexibility and openness, not by way of constraint and closedness.

Only then will we, the users of these systems, be able to find out what we previously could not find out.  At least, that is how the Internet is changing the way I think.

This entry was posted in Exploratory Search, Information Retrieval Foundations. Bookmark the permalink.

6 Responses to What You Can Find Out

  1. I actually think things have always been this way, at least since libraries. It’s why I never liked closed book exams when I was a professor (or student) — it just wasn’t how you really went about solving a problem.

    The social side of work is also important. I let students get any help they wanted on homework as long as they cited it. Because that’s how the real world works, at least in academia.

    Just trolling the web isn’t enough. There’s a key social aspect to the kind of knowledge you pick up standing in line for coffee at a conference or in the lunch room at work. Blogs are adding a bit more of this to the web.

    How about a game of Jeopardy where you can use the web? That’s how I’d play it if I had a class to teach how to search.

  2. jeremy says:

    But isn’t Jeopardy-search still known-item, fact lookup search?

    I’m interested more in open-ended questions. Exploratory information needs, where the goal is to understand the relationship between different pieces of knowledge, just as much as it is finding those pieces of knowledge in the first place.

    Right now, there seems to be a feeling that the only way we can understand the relationship between different pieces of knowledge is to be able to classify the data into some sort of taxonomy (or even folksonomy, I don’t care) and then use those semantics to form the basis of our comparisons and organization. That’s the parametric approach, where you classes form the functional structure around which you organize.

    I would like to propose a research agenda in which we turn this around. Instead of relying on pre-existing structure, or on relying on our ability to do entity extraction to create that structure, why not let the user sort out the information he or she finds. As the user is doing the sorting, giving explicit feedback on what data is related and what is not related, the search engine can non-parametrically start to find information that follows the same similarity/dissimilarity user-defined structure.

    Don’t pre-compute the structure. Let the user grow the structure organically, as the process evolves.

    Maybe that is the way we’ve always done things. But it is not the way that online information seeking systems are set up. Time to bring ’em back into traditional user behavior models?

  3. Your post is one of the most interesting (and clear) I’ve read in a while as the various threads are tied together. Both Mayer and you refer to the challenges of operating at vast scale (ie. “… capable of operating at vast scale.”, “Computational resources are going to be a challenge.”), and it maybe obvious what is meant but what do you mean by it?

  4. Pingback: Weekly Search & Social News: 01/19/2010 | Search Engine Journal

  5. jeremy says:

    @Dinesh: What I mean by it will be a paper submission to CIKM 2010. I’ll let you know if it gets in 🙂

  6. Pingback: Information Retrieval Gupf » Close the Loop!

Leave a Reply

Your email address will not be published.