Tomorrow’s Data

Jeff Dalton recently wrote about why he doesn’t want your search log data.  It is an interesting read, and I recommend going through the whole article and comments.  But I want to call attention to one thought in particular:

Academia should be building solutions for tomorrow’s data, not yesterday’s. What will the queries and documents look like in 5 or even 10 years and how can we improve retrieval for those? It’s not an easy question to answer, but you can watch Bruce Croft’s CIKM keynote for some ideas…I still believe in empirical research. However, I’m also well-aware that over-reliance on limited data can lead to overfitting and incremental changes instead of ground-breaking research. To use an analogy from Wall Street, we become too focused on quarterly paper deadlines and lose sight of the fundamental science.

It is a provokative thought, and I find it compelling.  By spending too much effort paying attention to yesterday’s — and even today’s — data, you wind up limiting yourself to the existing, visible gradient.  At the same time, an open question is how one develops for tomorrow’s data when that data by definition does not yet exist.  This is a question that I hope to address more in the upcoming months. Not answer, but address.  Most likely by pointing to work by other researchers not directly working on the IR task (as I’ve done a bit in the past).  Developing for tomorrow’s data is not an easy task, but it is also worth not dismissing just because it is too far beyond the needs of today’s users.

There’s no doubt that the information economy continues to create a lot of wealth, but I think it’s fair to ask if it’s also creating enough science to replenish the stock of scientific capital that it’s still burning through. I think it’s clear that chaotic, market-driven change is a good way to bring ideas quickly and efficiently from concept to profitable product. However, such a rapid churning of the institutional and cultural landscape ultimately may not be conducive to the kind of steady, expensive, long-term investment in fundamental research that produces the really big ideas that somewhere, at some completely unforeseeable point in the future, change the world.
Also: “I, Cringely” from October 2002, entitled Eating our Seed Corn
This entry was posted in General. Bookmark the permalink.

7 Responses to Tomorrow’s Data

  1. Jeff says:

    I’m glad you liked it.

    One motivation for writing it was that I was sick of seeing recent work using only AP/WSJ newswire documents. We’ve come so far!

    I agree that developing “tomorrow’s data” is a challenging. I think one way to approach the issues is to extrapolate properties and trends. For example one trend I see is the rising importance of real-time social connectedness evidenced by Facebook, Twitter, and cell phone messaging. Another trend is the use of location and geo-tagged data for augmented reality applications. We don’t have any collections or tasks that even attempt to address these. It’s possible to create them, at least on a small to medium-sized scale.

  2. jeremy says:

    Well, I think I would draw a distinction between “collections” and “data”. The collection (AP/WSJ/FBIS/CR/etc.) is only a subset of the overall data, which data also includes (1) the statements of user information need or the task that the user is trying to accomplish, and (2) the judgments of relevance to that information need, and (3) the evaluation criterion/metric that makes quantitative pronouncements of value on retrieval results, utilizing the aforementioned judgments. Those three things, taken together as a bundle, constitute the “data” of yesterday and today’s retrieval experiments.

    I have no problem with continuing to use WSJ articles. Oh, don’t get me wrong.. I think we should also look to other collections as well. I just think that old newspaper articles are, by themselves, not the problem. What make the data “yesterday”, imho, is more the other three factors: task, judgment, and evaluation metric.

    We should be dreaming up/creating new tasks, doing basic research to explore the possible. Things that users currently are not doing, but that they might want to do in the future. Do we know how much and to what degree users will take up these new tasks? No. Does that matter? No. That’s why it’s basic research, exploring the limits of what is possible.

    We just can’t have the attitude of “We won’t build it until they come”. That’s yesterday-oriented.

  3. @jeremy
    Really pleased you’ve picked up on Jeff’s note which as you say is a very compelling thought. It’s been bouncing around in my head for a week or so. I don’t have much to add except this which is both sort of related and the funniest thing ever (or not!).

    I’m constantly looking for interesting and quality publicly available data to build demos to showcase our technology. I heard about a really cool data set from a vendor (who shall definitely remain unnamed) who makes the data available for free under license to Universities, so that eliminated us. Anyway, the license stipulated that any technology developed with the data the vendor had the right to license the technology.

    I can understand Netflix mandating rights to the IP from the Netflix Prize winner but this is so absurd that I did fall over backwards!

  4. Pingback: Search geek weekly news update; Google social search leads the way | Search Engine Journal

  5. Jeff Dalton says:

    Agreed on the collection definition. Also agreed that can use news data, but the task, judgments, etc… shoudl be new and novel.

    For example, one key issue is that most of these collections don’t address temporal aspects. They are static. I could see more interesting tasks on tracking and recommendation in dynamic news collections.

    To be relevant I also think a modern collection would include heterogeneous media: blogs, twitter updates, newswire content, user-generated content on Flickr and Youtube, and broadcast media.

  6. jeremy says:

    True. The more information that is contained within a document, the more tasks you can overlay open a collection of such documents. But again, when considering the overall picture of “tomorrow’s data”, that is but one way to do it. We shouldn’t limit ourselves to thinking just that one way.

    And I understand the desire to work on heterogeneous media (I argued many years ago that modern collections need to be aware of music media on a web page, see here. But again, that’s just one path to relevance. There are all sorts of IR and information seeking techniques that still need to be developed, which techniques can be researched independently of the data type. Think interactivity. Imho, the more relevant, bigger nut to crack is how to make an effective interactive IR system, rather than a non-interactive query-response system like we currently have. It is more interesting, I think, to have a really good interactive system, developed only on top of old newspaper articles, than a blog search engine that simply returns a static ranked list of blog posts.

    But that’s just my personal feeling/interest.

  7. jeremy says:

    @dinesh: I hear ya. I think we’re going to see more and more of these sorts of conflicts. The data by itself has value. The processes themselves (as processes) that manipulate that data has value. And the product of the processes has even more value. In that tangled mess of data and process, who owns what?

    I wrote about this a little a few months ago: “Data Liberation and Ownership“.

Leave a Reply

Your email address will not be published.