In my previous post I talked a little about the notion that big data alone cannot solve many of our problems. I would like to give a more concrete example of this by discussing a paper published at CIKM 2008: “Retrievability: An Evaluation Measure for Higher Order Information Access Tasks” by Azzopardi and Vinay. In large part, my desire to discuss this paper comes from a few of Peter Norvig’s comments in the aforementioned thread on big data:
So the question comes down to: given that there will inevitably be this concentration in the head, do search engines and colaborative [sic] filtering services help you find things in the tail? i think they do, but you have to ask the right question. If you ask a very general question, you’ll get a very popular answer. If you ask a rare or specific question, you’ll get a result that is a weighted combination of the nearest neighbors, weighted by popularity. So the more data you have, the more oddball nearest neighbors you’ll have, but you have to put the query into a part of the space that is not already populated with very popular entries.
The trick is to have a better conversation with the user. If they want the head, a two-word query is sufficient conversation; if they want the tail it will take more. I showed how it can be done if the user takes the initiative;
If I understand, Peter is saying that if the broad query does not produce the results that sufficiently satisfy the user’s information need, the user can, by adding more query terms, produce these results while still relying on the micro-popularity of that narrowed collection space. Where I disagree is that even if the user adds more terms to the query, there is no guarantee that the user can get to the relevant, necessary region of the collection. This finally brings me to the aforementioned Retrievability paper:
With search playing an increasingly crucial role in the access to information, there are growing concerns over the role of this technology [8, 11]. This is because as larger amounts of information is being made available online, Information Retrieval (IR) systems, as exempliﬁed by the many search engines available today, are becoming the primary means of accessing this information . Consequently, issues are being raised from a number of diﬀerent areas questioning the inﬂuence that IR systems have on the access to information. For example, media regulators are concerned over whether search engines are biased towards particular web sites over others [16, 9], while e-Government administrators in the U.S. are now legally required to ensure that all government information online is accessible through search engines . Legal and patent searchers also need to ensure that they have IR systems which enable them to ﬁnd all documents in the collection relevant to their information need. These situations require a way to evaluate an IR system in terms of how much access they provide into the underlying collection. To address these types of questions, we must develop new measures that indicate how easily documents or sets of documents in the collection can be accessed given the IR system. These measures can then be used to compare the inﬂuence of diﬀerent IR systems on the access to information.
The goal of the paper is to provide a novel evaluation metric for recall-oriented, information-oriented searches based on how easily retrievable documents are. This is not an evaluation of home page navigation. It is an evaluation geared toward someone with a a deeper information need. The system effectiveness measure that these researchers come up with is that of Retrievability:
The ability to ﬁnd a document through the retrieval process is therefore a combination of whether or not the document is indexed, and then whether or not a document can be retrieved through querying. The ﬁrst factor has already been well studied [14, 6], illustrating the importance of making content discoverable by search engine crawlers to ensure inclusion in the index. The second factor is more subtle and is less understood. This is because it is generally assumed that if a document is indexed, it can be retrieved; the only impeding factors being:
- a user’s ability to formulate his/her need in the form of a suitable query,
- the retrieval system’s matching function, and,
- the user’s willingness to examine documents.
However, the combination of these factors means that some documents are more easily accessible than others. This is because, search is a process of discrimination: a user deciding how to pose the query, the IR system attempting to discriminate between relevant and non-relevant content, and the user wading through portions of the results retrieved by the system trying to ﬁnd content relevant to his/her need. The main objective of the retrieval system is to maximize eﬀectiveness, which implicitly leads to favoring certain types of documents over others; and so it can be said that the system is inherently biased. Applying or removing particular retrieval biases have been shown to signiﬁcantly improve or degrade eﬀectiveness. For example, using inlinks to favor more popular documents , or accounting for document length normalization  can lead to signiﬁcant improvements to retrieval eﬀectiveness. The dual nature of discrimination, e.g. “favor longer over shorter”, means that some documents can become more retrievable at the expense of others.
In other words, it’s not just a matter of knowing or not knowing the right query term(s) to use, as Norvig says. Rather, no matter what query you use, a good document might perpetually be buried further than you have time or energy to examine! This is because the system might naturally favor longer documents, or high inlink pages, or have some other (not even necessarily intentional) bias that makes certain pages essentially non-retrievable.
The experiment that Azzopardi and Vinay do to test this is to take all unigram terms v in the vocabulary V with df > 5, and all bigrams from VxV with df > 20, and use this entire set of unigrams and bigrams as queries. With so many different queries and query combinations, all documents should be retrieved by at least one of the queries, right? At least that’s the intuition that this metric is built on. Read the paper for the exact details, but they then run all these queries and make note of all the documents that never get retrieved within the top c documents of any query. They come up with a single coefficient (Gini) to characterize retrievability — the lower the value of this coefficient, the less bias there is in the retrieval algorithm and the more retrievable documents are.
The detail that I would like to point out is in Table 3 on page 6. On the GOV collection, the researchers compare the Retrievability of two algorithms: BM25 and BM25i, the latter being a variant of BM25 that takes into account inlink popularity. From the Table, we see that no matter if the user is willing to look at 10 results per query, or 20, 30, 50 or even 100, the BM25 algorithm is consistently less biased than the BM25i algorithm. Documents in a collection that use BM25 as a gatekeeper are more retrievable than when the inlink-popularity BM25i is the gatekeeper. Many more documents are never feasibly retrieved under BM25i than under BM25.
While I cannot extrapolate these results to any of the major web search engines, as the scale of their collections, links, and queries are much larger than the GOV collection, it does seem to suggest that a popularity, link-based retrieval algorithm might suffer from more bias, and therefore less retrievability. It’s clear that link-biasing is a good way to combat spam and help users navigate to authoritative home pages. What is less clear is how effective the technique is for information-oriented, recall-oriented needs — in fact these results seem to suggest that popularity bias (“the inherent democratic nature of the web”) might actually prevent more information from ever being seen, because it never appears at the top of anyone’s query list. That’s something to at least keep in mind.