In my previous post I talked a little about the notion that big data alone cannot solve many of our problems. I would like to give a more concrete example of this by discussing a paper published at CIKM 2008: “Retrievability: An Evaluation Measure for Higher Order Information Access Tasks” by Azzopardi and Vinay. In large part, my desire to discuss this paper comes from a few of Peter Norvig’s comments in the aforementioned thread on big data:
So the question comes down to: given that there will inevitably be this concentration in the head, do search engines and colaborative [sic] filtering services help you find things in the tail? i think they do, but you have to ask the right question. If you ask a very general question, you’ll get a very popular answer. If you ask a rare or specific question, you’ll get a result that is a weighted combination of the nearest neighbors, weighted by popularity. So the more data you have, the more oddball nearest neighbors you’ll have, but you have to put the query into a part of the space that is not already populated with very popular entries.
The trick is to have a better conversation with the user. If they want the head, a two-word query is sufficient conversation; if they want the tail it will take more. I showed how it can be done if the user takes the initiative;
If I understand, Peter is saying that if the broad query does not produce the results that sufficiently satisfy the user’s information need, the user can, by adding more query terms, produce these results while still relying on the micro-popularity of that narrowed collection space. Where I disagree is that even if the user adds more terms to the query, there is no guarantee that the user can get to the relevant, necessary region of the collection. This finally brings me to the aforementioned Retrievability paper:
With search playing an increasingly crucial role in the access to information, there are growing concerns over the role of this technology [8, 11]. This is because as larger amounts of information is being made available online, Information Retrieval (IR) systems, as exempliﬁed by the many search engines available today, are becoming the primary means of accessing this information . Consequently, issues are being raised from a number of diﬀerent areas questioning the inﬂuence that IR systems have on the access to information. For example, media regulators are concerned over whether search engines are biased towards particular web sites over others [16, 9], while e-Government administrators in the U.S. are now legally required to ensure that all government information online is accessible through search engines . Legal and patent searchers also need to ensure that they have IR systems which enable them to ﬁnd all documents in the collection relevant to their information need. These situations require a way to evaluate an IR system in terms of how much access they provide into the underlying collection. To address these types of questions, we must develop new measures that indicate how easily documents or sets of documents in the collection can be accessed given the IR system. These measures can then be used to compare the inﬂuence of diﬀerent IR systems on the access to information.
The goal of the paper is to provide a novel evaluation metric for recall-oriented, information-oriented searches based on how easily retrievable documents are. This is not an evaluation of home page navigation. It is an evaluation geared toward someone with a a deeper information need. The system effectiveness measure that these researchers come up with is that of Retrievability:
The ability to ﬁnd a document through the retrieval process is therefore a combination of whether or not the document is indexed, and then whether or not a document can be retrieved through querying. The ﬁrst factor has already been well studied [14, 6], illustrating the importance of making content discoverable by search engine crawlers to ensure inclusion in the index. The second factor is more subtle and is less understood. This is because it is generally assumed that if a document is indexed, it can be retrieved; the only impeding factors being:
- a user’s ability to formulate his/her need in the form of a suitable query,
- the retrieval system’s matching function, and,
- the user’s willingness to examine documents.
However, the combination of these factors means that some documents are more easily accessible than others. This is because, search is a process of discrimination: a user deciding how to pose the query, the IR system attempting to discriminate between relevant and non-relevant content, and the user wading through portions of the results retrieved by the system trying to ﬁnd content relevant to his/her need. The main objective of the retrieval system is to maximize eﬀectiveness, which implicitly leads to favoring certain types of documents over others; and so it can be said that the system is inherently biased. Applying or removing particular retrieval biases have been shown to signiﬁcantly improve or degrade eﬀectiveness. For example, using inlinks to favor more popular documents , or accounting for document length normalization  can lead to signiﬁcant improvements to retrieval eﬀectiveness. The dual nature of discrimination, e.g. “favor longer over shorter”, means that some documents can become more retrievable at the expense of others.
In other words, it’s not just a matter of knowing or not knowing the right query term(s) to use, as Norvig says. Rather, no matter what query you use, a good document might perpetually be buried further than you have time or energy to examine! This is because the system might naturally favor longer documents, or high inlink pages, or have some other (not even necessarily intentional) bias that makes certain pages essentially non-retrievable.
The experiment that Azzopardi and Vinay do to test this is to take all unigram terms v in the vocabulary V with df > 5, and all bigrams from VxV with df > 20, and use this entire set of unigrams and bigrams as queries. With so many different queries and query combinations, all documents should be retrieved by at least one of the queries, right? At least that’s the intuition that this metric is built on. Read the paper for the exact details, but they then run all these queries and make note of all the documents that never get retrieved within the top c documents of any query. They come up with a single coefficient (Gini) to characterize retrievability — the lower the value of this coefficient, the less bias there is in the retrieval algorithm and the more retrievable documents are.
The detail that I would like to point out is in Table 3 on page 6. On the GOV collection, the researchers compare the Retrievability of two algorithms: BM25 and BM25i, the latter being a variant of BM25 that takes into account inlink popularity. From the Table, we see that no matter if the user is willing to look at 10 results per query, or 20, 30, 50 or even 100, the BM25 algorithm is consistently less biased than the BM25i algorithm. Documents in a collection that use BM25 as a gatekeeper are more retrievable than when the inlink-popularity BM25i is the gatekeeper. Many more documents are never feasibly retrieved under BM25i than under BM25.
While I cannot extrapolate these results to any of the major web search engines, as the scale of their collections, links, and queries are much larger than the GOV collection, it does seem to suggest that a popularity, link-based retrieval algorithm might suffer from more bias, and therefore less retrievability. It’s clear that link-biasing is a good way to combat spam and help users navigate to authoritative home pages. What is less clear is how effective the technique is for information-oriented, recall-oriented needs — in fact these results seem to suggest that popularity bias (“the inherent democratic nature of the web”) might actually prevent more information from ever being seen, because it never appears at the top of anyone’s query list. That’s something to at least keep in mind.
I’ve only skimmed your posts on this subject here and on the Noisy Channel and plan to read in detail later. However, this stood out: “In other words, it’s not just a matter of knowing or not knowing the right query term(s) to use, as Norvig says. Rather, no matter what query you use, a good document might perpetually be buried further than you have time or energy to examine! This is because the system might naturally favor longer documents, or high inlink pages, or have some other (not even necessarily intentional) bias that makes certain pages essentially non-retrievable.”
This is one of the areas that we address ie. find items similar to the query items in “ranked” order.
But the point of the Azzopardi & Vinay paper is to point out that there might not be items similar enough to any possible query to be able to rank in the top n for whatever query that you use. How does the similarity scoring that you use avoid that non-retrievability issue? How do you guarantee that every single item in your collection is retrievable (by at least one query other than the item itself, which is a degenerate case)? Are there more details or a white paper anywhere that you could share?
For a given query, we rank all items. The question then is: what does “retrievable” mean? If the item appears on page 10 of the results, has it been retrieved?
Retrievability of an item means that there exist some non-trivial query for which the rank of that item in the results list is < c, where c is the number of items that your average use is typically willing to examine, e.g. 10 or 20 or maybe even 50, depending on the task.
And by non-trivial, I mean that you are not using the item itself (or a long exact quotation from the item itself) as the query — of course the item would come up top-ranked in that case, but that’s beside the point because why would you ever search for that item, if you already have it?
Because even if you rank all items, if a certain item never comes up in the top c to any reasonable query that a user might enter, then for all intents and purposes, that item is non-retrievable. For example, if that item never achieved a rank higher than 453 for any query, then no one is ever going to find it, even if it has been ranked.
So, then, what would be nice if search engines allowed you to turn certain factors, certain biases on/off (e.g. turn off what we know as PageRank when searching Google’s index)
But not all documents are made equal. Why is it so important that all documents are (easily) retrievable? Maybe some really don’t have unique-enough content that doesn’t already exist in other, “better” documents. Or are you saying that there are no better or worse documents because there should be no bias?
Assuming for a moment that we’re not talking about spam documents — which I of course agree should not be retrievable — I am of the mindset that everything is valuable to someone. Otherwise, why would someone have created that page in the first place? (BTW, I would call a document that copies another document a form of spam.)
Look at the experiments that were done in this paper: One was a collection of government documents from the web, the other was a collection of newswire articles. All easy jokes about the government information aside, I would say that all of that information is at least important to someone. So it needs to be retrievable. And same with the newswire articles. If it was newsworthy enough to be written about in the first place, then some future historian, trying to understand the early 21st century, is going to be able to need to retrieve it. If the systems we built forever condemn certain pages to non-retrievability, well.. we should at least be conscious of this fact.
In short, this is not just about duplicate content and spam. In fact, I’ll bet the collections that this paper above tested on has no spam, and very little duplicate content. And despite the “cleanness” of those collections, there were still large numbers of documents that were completely non-retrievable.
A shorter reply: I’m not saying that search engines shouldn’t necessarily exhibit bias. I’m saying that we, the searchers, should be able to turn that bias on and off, when necessary. This is what you also say, above.. so I think we’re fairly in agreement on that.
Yes, that’s what I said (not sure where comments about spam and copies comes from, as I didn’t mention that). I see the new post about cafes in Praha – see, you are looking for a bias there – you like hidden, less-popular cafes. So each document needs to have as many facets as possible/applicable and allow the searcher to turn various pieces on/off. In an ideal world. 🙂
No, you didn’t mention spam by name, but you did talk about non-original/duplicate content. And that’s often a feature of spam (e.g. splogs).
I don’t quite agree with you that my information need about finding cafes in Prague that are located inside passages and at the end of tiny, old town meanering lanes is “bias”. That’s just my information need. Some people use a search engine to find Gucci stores in Singapore. Some people use a search engine to find used book stores in Newton, Iowa. I want to be able to use a search engine to find off-beaten-path cafes in Prague. That’s just my info need.
If the search engine is fundamentally incapable of finding information that meets my need, because it has built into it an algorithm that makes the cafes that I am looking for “unretrievable”, then the search engine itself has bias. But I do not.
And that’s also the answer to your question about “Why is it so important that all documents are (easily) retrievable?” Because if those documents are relevant to my information needs, I need to be able to find them. If the search engine prevents me from finding them, because of its bias, then the search engine is flawed.