Comments on: Tomorrow’s Data

By: jeremy

jeremy — Wed, 04 Nov 2009 00:22:12 +0000

@dinesh: I hear ya. I think we're going to see more and more of these sorts of conflicts. The data by itself has value. The processes themselves (as processes) that manipulate that data has value. And the product of the processes has even more value. In that tangled mess of data and process, who owns what? I wrote about this a little a few months ago: "Data Liberation and Ownership".

By: jeremy

jeremy — Tue, 03 Nov 2009 20:14:13 +0000

True. The more information that is contained within a document, the more tasks you can overlay open a collection of such documents. But again, when considering the overall picture of "tomorrow's data", that is but one way to do it. We shouldn't limit ourselves to thinking just that one way. And I understand the desire to work on heterogeneous media (I argued many years ago that modern collections need to be aware of music media on a web page, see here. But again, that's just one path to relevance. There are all sorts of IR and information seeking techniques that still need to be developed, which techniques can be researched independently of the data type. Think interactivity. Imho, the more relevant, bigger nut to crack is how to make an effective interactive IR system, rather than a non-interactive query-response system like we currently have. It is more interesting, I think, to have a really good interactive system, developed only on top of old newspaper articles, than a blog search engine that simply returns a static ranked list of blog posts. But that's just my personal feeling/interest.

By: Jeff Dalton

Jeff Dalton — Tue, 03 Nov 2009 18:46:27 +0000

Agreed on the collection definition. Also agreed that can use news data, but the task, judgments, etc… shoudl be new and novel.

For example, one key issue is that most of these collections don’t address temporal aspects. They are static. I could see more interesting tasks on tracking and recommendation in dynamic news collections.

To be relevant I also think a modern collection would include heterogeneous media: blogs, twitter updates, newswire content, user-generated content on Flickr and Youtube, and broadcast media.

By: Search geek weekly news update; Google social search leads the way | Search Engine Journal

Tue, 03 Nov 2009 15:36:07 +0000

[…] Tomorrow’s Data – IR Gupf […]

By: dinesh vadhia

dinesh vadhia — Tue, 03 Nov 2009 13:39:16 +0000

@jeremy
Really pleased you’ve picked up on Jeff’s note which as you say is a very compelling thought. It’s been bouncing around in my head for a week or so. I don’t have much to add except this which is both sort of related and the funniest thing ever (or not!).

I’m constantly looking for interesting and quality publicly available data to build demos to showcase our technology. I heard about a really cool data set from a vendor (who shall definitely remain unnamed) who makes the data available for free under license to Universities, so that eliminated us. Anyway, the license stipulated that any technology developed with the data the vendor had the right to license the technology.

I can understand Netflix mandating rights to the IP from the Netflix Prize winner but this is so absurd that I did fall over backwards!

By: jeremy

jeremy — Mon, 02 Nov 2009 17:51:34 +0000

In reply to Jeff.

Well, I think I would draw a distinction between “collections” and “data”. The collection (AP/WSJ/FBIS/CR/etc.) is only a subset of the overall data, which data also includes (1) the statements of user information need or the task that the user is trying to accomplish, and (2) the judgments of relevance to that information need, and (3) the evaluation criterion/metric that makes quantitative pronouncements of value on retrieval results, utilizing the aforementioned judgments. Those three things, taken together as a bundle, constitute the “data” of yesterday and today’s retrieval experiments.

I have no problem with continuing to use WSJ articles. Oh, don’t get me wrong.. I think we should also look to other collections as well. I just think that old newspaper articles are, by themselves, not the problem. What make the data “yesterday”, imho, is more the other three factors: task, judgment, and evaluation metric.

We should be dreaming up/creating new tasks, doing basic research to explore the possible. Things that users currently are not doing, but that they might want to do in the future. Do we know how much and to what degree users will take up these new tasks? No. Does that matter? No. That’s why it’s basic research, exploring the limits of what is possible.

We just can’t have the attitude of “We won’t build it until they come”. That’s yesterday-oriented.

By: Jeff

Jeff — Mon, 02 Nov 2009 15:29:43 +0000

I’m glad you liked it.

One motivation for writing it was that I was sick of seeing recent work using only AP/WSJ newswire documents. We’ve come so far!

I agree that developing “tomorrow’s data” is a challenging. I think one way to approach the issues is to extrapolate properties and trends. For example one trend I see is the rising importance of real-time social connectedness evidenced by Facebook, Twitter, and cell phone messaging. Another trend is the use of location and geo-tagged data for augmented reality applications. We don’t have any collections or tasks that even attempt to address these. It’s possible to create them, at least on a small to medium-sized scale.