Comments on: Retrievability

By: jeremy

jeremy — Thu, 23 Apr 2009 18:31:25 +0000

No, you didn’t mention spam by name, but you did talk about non-original/duplicate content. And that’s often a feature of spam (e.g. splogs).

I don’t quite agree with you that my information need about finding cafes in Prague that are located inside passages and at the end of tiny, old town meanering lanes is “bias”. That’s just my information need. Some people use a search engine to find Gucci stores in Singapore. Some people use a search engine to find used book stores in Newton, Iowa. I want to be able to use a search engine to find off-beaten-path cafes in Prague. That’s just my info need.

If the search engine is fundamentally incapable of finding information that meets my need, because it has built into it an algorithm that makes the cafes that I am looking for “unretrievable”, then the search engine itself has bias. But I do not.

And that’s also the answer to your question about “Why is it so important that all documents are (easily) retrievable?” Because if those documents are relevant to my information needs, I need to be able to find them. If the search engine prevents me from finding them, because of its bias, then the search engine is flawed.

By: Otis Gospodnetic

Otis Gospodnetic — Thu, 23 Apr 2009 15:01:41 +0000

Yes, that’s what I said (not sure where comments about spam and copies comes from, as I didn’t mention that). I see the new post about cafes in Praha – see, you are looking for a bias there – you like hidden, less-popular cafes. So each document needs to have as many facets as possible/applicable and allow the searcher to turn various pieces on/off. In an ideal world. 🙂

By: jeremy

jeremy — Thu, 23 Apr 2009 05:13:12 +0000

In reply to jeremy. A shorter reply: I'm not saying that search engines shouldn't necessarily exhibit bias. I'm saying that we, the searchers, should be able to turn that bias on and off, when necessary. This is what you also say, above.. so I think we're fairly in agreement on that.

By: jeremy

jeremy — Thu, 23 Apr 2009 05:12:02 +0000

In reply to Otis Gospodnetic.

Assuming for a moment that we’re not talking about spam documents — which I of course agree should not be retrievable — I am of the mindset that everything is valuable to someone. Otherwise, why would someone have created that page in the first place? (BTW, I would call a document that copies another document a form of spam.)

Look at the experiments that were done in this paper: One was a collection of government documents from the web, the other was a collection of newswire articles. All easy jokes about the government information aside, I would say that all of that information is at least important to someone. So it needs to be retrievable. And same with the newswire articles. If it was newsworthy enough to be written about in the first place, then some future historian, trying to understand the early 21st century, is going to be able to need to retrieve it. If the systems we built forever condemn certain pages to non-retrievability, well.. we should at least be conscious of this fact.

In short, this is not just about duplicate content and spam. In fact, I’ll bet the collections that this paper above tested on has no spam, and very little duplicate content. And despite the “cleanness” of those collections, there were still large numbers of documents that were completely non-retrievable.

By: Otis Gospodnetic

Otis Gospodnetic — Thu, 23 Apr 2009 01:49:33 +0000

Interesting!
So, then, what would be nice if search engines allowed you to turn certain factors, certain biases on/off (e.g. turn off what we know as PageRank when searching Google’s index)

But not all documents are made equal. Why is it so important that all documents are (easily) retrievable? Maybe some really don’t have unique-enough content that doesn’t already exist in other, “better” documents. Or are you saying that there are no better or worse documents because there should be no bias?

By: jeremy

jeremy — Wed, 15 Apr 2009 16:30:13 +0000

In reply to dinesh vadhia. Retrievability of an item means that there exist some non-trivial query for which the rank of that item in the results list is < c, where c is the number of items that your average use is typically willing to examine, e.g. 10 or 20 or maybe even 50, depending on the task. And by non-trivial, I mean that you are not using the item itself (or a long exact quotation from the item itself) as the query -- of course the item would come up top-ranked in that case, but that's beside the point because why would you ever search for that item, if you already have it? Because even if you rank all items, if a certain item never comes up in the top c to any reasonable query that a user might enter, then for all intents and purposes, that item is non-retrievable. For example, if that item never achieved a rank higher than 453 for any query, then no one is ever going to find it, even if it has been ranked.

By: dinesh vadhia

dinesh vadhia — Wed, 15 Apr 2009 13:13:58 +0000

For a given query, we rank all items. The question then is: what does “retrievable” mean? If the item appears on page 10 of the results, has it been retrieved?

By: jeremy

jeremy — Thu, 09 Apr 2009 22:02:57 +0000

But the point of the Azzopardi & Vinay paper is to point out that there might not be items similar enough to any possible query to be able to rank in the top n for whatever query that you use. How does the similarity scoring that you use avoid that non-retrievability issue? How do you guarantee that every single item in your collection is retrievable (by at least one query other than the item itself, which is a degenerate case)? Are there more details or a white paper anywhere that you could share?

By: dinesh vadhia

dinesh vadhia — Thu, 09 Apr 2009 20:33:17 +0000

I’ve only skimmed your posts on this subject here and on the Noisy Channel and plan to read in detail later. However, this stood out: “In other words, it’s not just a matter of knowing or not knowing the right query term(s) to use, as Norvig says. Rather, no matter what query you use, a good document might perpetually be buried further than you have time or energy to examine! This is because the system might naturally favor longer documents, or high inlink pages, or have some other (not even necessarily intentional) bias that makes certain pages essentially non-retrievable.”

This is one of the areas that we address ie. find items similar to the query items in “ranked” order.

Dinesh