Information Retrieval Gupf

Loss Leaders versus Exploratory Search

Posted on December 15, 2009 by jeremy

Chris Dixon has a post yesterday about search and the social graph. An interesting read, but what struck me the most was a tangent about how current search engines make money:

Lost amid this discussion, however, is that the links people tend to share on social networks – news, blog posts, videos – are in categories Google barely makes money on. (The same point also seems lost on Rupert Murdoch and news organizations who accuse Google of profiting off their misery).

Searches related to news, blog posts, funny videos, etc. are mostly a loss leaders for Google. Google’s real business is selling ads for plane tickets, dvd players, and malpractice lawyers. (I realize this might be depressing to some internet idealists, but it’s a reality). Online advertising revenue is directly correlated with finding users who have purchasing intent. Google’s true primary competitive threats are product-related sites, especially Amazon. As it gets harder to find a washing machine on Google, people will skip search and go directly to Amazon and other product-related sites.

I’ll repeat the salient bit: “Google’s real business is selling ads for plane tickets, dvd players, and malpractice lawyers.” What struck me about this statement was not its veracity. What struck me was its relationship to exploratory search. It is when searching for a plane ticket, purchasing an expensive consumer good, or hiring a decent lawyer that my need for exploratory search is at its highest.

So my question is whether or not there is a tension here between getting the users off of the results page as quickly as possible — especially when the route off that page is typically via an advertisement on which the search engine makes money — versus enabling the user to remain on the results page in a process-oriented mode of sorting and filtering and playing around with the results in a myriad of different ways, so as to come up with a set of options that best satisfies the exploratory need.

Do these two goals conflict? Why or why not? It is an old question, but I am still searching for a satisfactory answer.

Update: Perhaps I should have been more clear as to what characterizes an exploratory search session. There are dozens of papers out there that tell the story much better than I can, so I will quote one of them. It’s by Michael Levi at the U.S. Bureau of Labor Statistics, published at the Information Seeking Support Systems (ISSS) workshop in June 2008. Title of the paper is “Musings on Information Seeking Support Systems”. (See http://ils.unc.edu/ISSS/ISSS_final_report.pdf) I quote:

Some characteristics of open-ended, discovery-oriented exploration emerge:

1) I may not know, at the beginning, whether a seemingly straightforward line of inquiry will expand beyond recognition. Sometimes it will, sometimes it won’t. A lot depends on my mood at any given moment.

2) I can’t predict when the exploration will end. It may be when I’m satisfied that I have learned enough (which also would vary from day to day and query to query.) It may be when I get tired or bored. It may be when I’ve run out of time. Or it may be when I get distracted by dinner or the allure of the swimming pool.

3) I can’t determine, objectively, whether the exploration has been a success. There is usually no “right answer” against which I can measure my progress.

4) My exploration is not a linear process. I could get interested in a tangent at any time from which I may not return. I am also likely to backtrack, possibly with some regularity, either because a tangent proved unfulfilling and I want to resume my original quest, or because I thought of a new question (or a new way of formulating a previous question) to direct at a resource I visited previously.

5) I am likely to want to combine, compare, or contrast information from multiple sources. One of those sources is my memory – which may or may not be reliable in any given circumstance.

Levi then makes a number of recommendations about what an information seeking support system should do, to enable this sort of exploratory search:

A useful information seeking support system, then, would require the following minimum functionality:

1) It should not interfere with my behavior as listed under Characteristics of Exploration above.

2) It should give me capabilities at least as good as those listed under Manual Tools above.

3) It should positively assist my explorations by making them easier or faster or more comprehensive or less error-prone or…

In addition, an ISSS might give me capabilities that I never employed before because they were not possible or because I didn’t think of them.

But, to be truly a leap forwards, an ISSS would need to exhibit at least elements of discernment, judgment, subject matter expertise, and research savvy.

Again the question: Is there a tension here between getting the users off of the results page as quickly as possible — especially when the route off that page is typically via an advertisement on which the search engine makes money — versus enabling the user to remain on the results page in a process-oriented mode of sorting and filtering and playing around with the results in a myriad of different ways, so as to come up with a set of options that best satisfies the exploratory need?

I’ve already heard certain search engines state that their goal is to get the user off the search page as quickly as possible. That it and of itself tells me that they’re specifically designing the system so as to interfere with behaviors listed under Characteristics of Exploration above (Levi’s first recommendation). Why does it interfere? Because my goal is to stick around in the results and compare and contrast, whereas their goal is to get me off of the page as quickly as possible. And so the whole system is designed to do the opposite of what I want it to do.

Additionally, I was also pointing out that the information domains on which I usually have the largest exploratory-type information needs are very similar to the information domains on which the search engines make most to all of their money. I’m still trying to figure out what to make of that.

Thoughts?

Posted in Exploratory Search, Information Retrieval Foundations, Social Implications | 9 Comments

Lookup is to Exploratory Search as P is to NP

Posted on November 19, 2009 by jeremy

Daniel T. has an interesting bipartite use-case model for exploratory search:

I know what I want, but I don’t know how to describe it.

I don’t know what I want, but I hope to figure it out once I see what’s out there.

Perhaps this is a silly analogy, but framing the problem in this way reminded me abstractly of P vs. NP. Some problems can be both computed and verified in polynomial (P) time. Other problems can be verified in P time, but it is unknown whether a P-time solution to the problem exists. These are the non-deterministic polynomial set of problems (NP). In the worst case, it might take exponential time to get the answer.

Google and related web search engines are lookup, navigational, known item engines. You can both obtain and verify your answer in polynomial time. Linear even (hence the classic ranked list).

With an exploratory information need, the satisfaction of your information need can be verified in polynomial time. It doesn’t take too long to examine the assembled set of summarized/contrasted/accumulated information and tell whether or not your information need has been satisfied. Maybe it doesn’t take constant time, but it certainly can be accomplished in linear time. But accumulating that information in the first place? It is generally unknown how long that will take, as there is a bit of non-determinism in the information seeking pathways that you need to traverse. Exploratory search is, dare I say, NP.

So the big question: Is P = NP. That is to say, can one use a tool such as Yahoo!, Google, etc. which has been generally optimized for lookup, P-time problems and use it to satisfy one’s exploratory information seeking task? Certainly one can try and use these tools in this manner. Nothing stops a user from entering vast quantities of queries and accumulating the necessary set of information themselves. But the tool has not been designed for that purpose. So can it really be used to solve that problem? Are multiple iterations of lookup search capable of satisfying an exploratory information need? Does P = NP?

I don’t think that it does.

Question for the day: For certain classes of NP problems (e.g. the knapsack problem) there are often heuristic that yield good approximations (nearly-optimal) solutions in P-time. What are the analogous classes of problems in the exploratory information seeking domain? And how would we, in general, recognize them?

Posted in Exploratory Search, Information Retrieval Foundations | Leave a comment

The Tyranny of Simplicity, Redux

Posted on November 16, 2009 by jeremy

One of my ongoing research interest areas is in retrieval interfaces that allow more expressive and powerful statements of a user information need. In that spirit, I wrote a minor rant last April about how the Apple iTunes smart playlist creation interface sacrifices functionality in the interest of simplicity. One could only create smart playlists using a flat conjunction or flat disjunction of expressions. See this screenshot:

Well, the times they are a’changing. I just noticed that the newest version of iTunes (9.0) allows arbitrarily-nested conjunctions and disjunctions. This ability to mix and match gives rise to much greater capability, and only adds the minimum of interface clutter and complexity, i.e. expression indentation and an additional (…) button:

I laud the change and improvement, and I feel that it is another step in the ongoing attempt to raise consciousness about the value of moving beyond barren and crippled-functionality information organization interfaces. That is one of the core challenges of HCIR, and I see Apple now taking another step in this direction.

Posted in Information Retrieval Foundations, Music IR | 1 Comment

More Information Is Positive

Posted on November 9, 2009 by jeremy

Via Greg Linden, I came across this interesting quote from Eric Schmidt about the obligation to help newspapers succeed:

Finally, Eric claimed Google has a moral duty to help newspapers succeed:

Google sees itself as trying to make the world a better place. And our values are that more information is positive — transparency. And the historic role of the press was to provide transparency, from Watergate on and so forth. So we really do have a moral responsibility to help solve this problem.

Well-funded, targeted professionally managed investigative journalism is a necessary precondition in my view to a functioning democracy … That’s what we worry about … There [must be] enough revenue that … the newspaper [can] fulfill its mission.

This is great that Google feels this professional responsibility. And I wholeheartedly agree with Schmidt that “more information is positive”. My only question is: Why don’t we see “more information” and transparency when it comes to other media companies, aka search engines? Newspapers engage in investigative journalism in order to bring stories from industry and politics to the citizens. Search engines engage in algorithmic retrieval in order to bring stories from the newspapers (and other sources) to the citizens. The historical role of the press has been to provide transparency. So also is the modern role of the retrieval engine to provide transparency. And just as a good reporter has to cite sources to make their stories credible, so should a search algorithm provide explanatory interfaces, algorithms, and information to make their results credible.

Shouldn’t there be an expectation of as much information and transparency from our search interfaces and algorithms as we have from our press? It is no secret that I think there should be. It is a goal that I strive for in my own research; I can’t say that it’s not difficult, but it is worth striving for.

The Craft of Storytelling

Posted on November 5, 2009 by jeremy

I’ve been playing around with some old TREC data over the past few days and completely by chance I came across this document. I find it interesting because storytelling is a good metaphor for what we as researchers do when we construct interactive information seeking systems. The document is short enough that I think I can reproduce it here in its entirety without getting into intellectual property trouble. I hope.

DOCNO: LA070590-0123

DOCID: 243123

July 5, 1990, Thursday, Home Edition

Calendar; Part F; Page 1; Column 1; Calendar Desk

57 words

QUOTABLE

“Networks are run by people whose weakest suit is that they can’t understand the importance of the craft of storytelling, which is what film and television are all about. . . . They can do statistical things, but they can’t quantify storytelling and put it into a computer.”

Writer-producer Roy Huggins, in Television & Families magazine

Wikipedia’s take on Roy Huggins.

Posted in Explanatory Search, General | Leave a comment

Good Interaction Design II: Just Ask

Posted on November 4, 2009 by jeremy

Last March I pointed out a short piece by Tessa Lau about how good interaction design trumps smart algorithms. Today I have a followup. In particular, Xavier Amatriain has a good writeup of the recently concluded Netflix contest. Some of the lessons learned by going through the process are related to the importance of good evaluation metrics, the effect of (lapsed) time, matrix factorization, algorithm combination, and the value of data.

Data is always important, but what struck me in the writeup was his discovery that the biggest advances came not from accumulation of massive amount of data, log files, clicks, etc. Rather, while dozens and dozens of researchers around the world were struggling to reach that coveted 10% improvement by eking out every last drop of value from large data-only methods, Amatriain comparatively easily blew past that ceiling and hit 14%.

How? Continue reading →

Posted in Information Retrieval Foundations | 4 Comments

Tomorrow’s Data

Posted on November 2, 2009 by jeremy

Jeff Dalton recently wrote about why he doesn’t want your search log data. It is an interesting read, and I recommend going through the whole article and comments. But I want to call attention to one thought in particular:

Academia should be building solutions for tomorrow’s data, not yesterday’s. What will the queries and documents look like in 5 or even 10 years and how can we improve retrieval for those? It’s not an easy question to answer, but you can watch Bruce Croft’s CIKM keynote for some ideas…I still believe in empirical research. However, I’m also well-aware that over-reliance on limited data can lead to overfitting and incremental changes instead of ground-breaking research. To use an analogy from Wall Street, we become too focused on quarterly paper deadlines and lose sight of the fundamental science.

It is a provokative thought, and I find it compelling. By spending too much effort paying attention to yesterday’s — and even today’s — data, you wind up limiting yourself to the existing, visible gradient. At the same time, an open question is how one develops for tomorrow’s data when that data by definition does not yet exist. This is a question that I hope to address more in the upcoming months. Not answer, but address. Most likely by pointing to work by other researchers not directly working on the IR task (as I’ve done a bit in the past). Developing for tomorrow’s data is not an easy task, but it is also worth not dismissing just because it is too far beyond the needs of today’s users.

See also: AT&T Labs vs. Google Labs: Not your grandfathers R&D (Ars Technica, 2006)

There’s no doubt that the information economy continues to create a lot of wealth, but I think it’s fair to ask if it’s also creating enough science to replenish the stock of scientific capital that it’s still burning through. I think it’s clear that chaotic, market-driven change is a good way to bring ideas quickly and efficiently from concept to profitable product. However, such a rapid churning of the institutional and cultural landscape ultimately may not be conducive to the kind of steady, expensive, long-term investment in fundamental research that produces the really big ideas that somewhere, at some completely unforeseeable point in the future, change the world.

Also: “I, Cringely” from October 2002, entitled Eating our Seed Corn

Posted in General | 7 Comments

Doing to Music What They Did to the Web

Posted on October 28, 2009 by jeremy

I’ve added a couple of updates to my previous post about the “Google Discover Music” service that is launching today. See also Paul’s writeup.

But I have been reading Danny’s Sullivan’s liveblog of the release event, and came across a quote that made me chuckle out loud:

Bill talking about how this will let people hear more diverse music. “They’re [Google Music is] going to do for music what they did for the web.”

Oh my goodness, I hope not! Because what they did for the web is put a popularity filter in front of their content-based search mechanism:

Google search works because it relies on the millions of individuals posting links on websites to help determine which other sites offer content of value. We assess the importance of every web page using more than 200 signals and a variety of techniques, including our patented PageRank™ algorithm, which analyzes which sites have been “voted” to be the best sources of information by other pages across the web. As the web gets bigger, this approach actually improves, as each new site is another point of information and another vote to be counted. In the same vein, we are active in open source software development, where innovation takes place through the collective effort of many programmers.

I do not want my music retrieval and discovery algorithms to be powered by the millions of individual posting (and click) links in order to help determine which musicians and songs offer content of value. I do not want my music search results to have been “voted” their way into my results list. I do not want such a music search service to get even bigger by counting even more points of information and votes.

If Google ends up doing to music what they did to the web, they will destroy music. Please let it not be so. As Brian Whitman, founder of The Echo Nest, recently said at a conference:

“If we only used collaborative filtering to discover music, the popular artists would eat the unknowns alive.”

Yup.

UPDATE: I just noticed something in this new Google Music service that I hadn’t noticed before: Popups! Check out this explanation video from the official Google blog, starting at 0:34 and going to 0:47. Compare and contrast that with the official Google position on popups on the Google site:

We do not allow pop-up ads of any kind on our site. We find them annoying.

But there is a solution! Google recommends the following:

If you are experiencing pop-ups generated by one of these malicious programs, you may want to remove the pop-up program from your computer.

Hmmm….

Posted in General | 6 Comments

Music Search: Exploration or Lookup?

Posted on October 21, 2009 by jeremy

TechCrunch is reporting a new Google Music service, purportedly to be released in about a week here in the U.S.:

Matt Ghering, a product marketing manager at Google, has been one of the people talking to the big four music labels about the new service, we’ve heard from one of our sources. And he has supposedly sent these screenshots of the look and feel of Google Music search to various rights holders and potential partners. The first screenshot shows how a search result might look on Google for a search for “U2.” A picture of the band is to the left of four streaming options for various songs, and the user has the option of listening via either iLike or LaLa. Click on one of the results, and a player pops up from the services that streams the song, along with an option to purchase the song for download.

I suppose the ability to find/stream a particular, known song is nice. But that is not what music search / music retrieval is about. Music retrieval is a fundamentally exploratory domain. When you are looking for music to accompany a photo slideshow, or music to create a playlist at a party, or music to DJ at a social dance event (e.g. salsa or waltz), or simply want to discover new and interesting bands, genres, etc. a known item search is not very helpful. You have to already know exactly what song (or band) you want in order to ask for exactly that song (or band).

With exploratory search, on the other hand, you don’t know what you don’t know. When you want to find that perfect song for your photo slideshow, you may have never heard the song before, much less even heard of the artist that wrote/performed it. How are you going to navigate the space of all recorded music to find your song, if all you have is a single line text input box?

You can’t.

Simple user interfaces are nice, until they become so simple and focused that they become unusable for your information need. The metaphor that has been so successful for years in Web search does not apply to music search. We will have to wait until next week to see if the leaked screenshots are indeed what the service will look like. But if those turn out to be accurate, I have to seriously question the decision-making process that led to this conflation of Web Search user experience and Music Search user experience. The goals are often so fundamentally different that I have a hard time understanding why the former got applied to the latter.

Loopy Results and Continuous Deployment

Posted on September 23, 2009 by jeremy

I have more questions than I have answers. One of the topics that I know very little about, and on which I often seek clarification and wisdom, is A/B testing in the context of rapid iteration, rapid deployment online systems. So I’d like to ask a question of my readership (all four of you 😉 )

Suppose Feature B tests significantly better than A. You therefore roll out B. Furthermore, suppose later on that Feature C tests significantly better than B. You again roll out C. Now, suppose you then do an A/C test, and find that your original Feature A tests significantly better than C.

What do you do? Do you Pick A again, even though that’s where you started? Roll back to B, because that beats A? Stick with C, because that beats B?

I’ve worked on enough low-level retrieval algorithms to have seen things like this. Changes and improvements do not always monotonically increase. When you run into a loop like this, what do you do? Sure, it would be nice to come up with a Feature D that beats all of them. And in an offline system, with no real users depending on you minute-by-minute, you can take the research time to find D. But in the heat-of-the-moment online system, one in which rapid iteration is a highly valued process, which of A, B, or C do you roll out to your end users?

Posted in Information Retrieval Foundations | 10 Comments