Google researchers Alon Halevy, Peter Norvig, and Fernando Pereira have an article in IEEE Computer magazine entitled “The Unreasonable Effectiveness of Data“. The article continues a theme that has been running strong within Google circles for the past half decade about how training a simple algorithm with larger amounts of data is more effective than having a smart algorithm that tries to generalize or draw inferences from smaller amounts of data:
The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available. For instance, we find that useful semantic relationships can be automatically learned from the statistics of search queries and the corresponding results or from the accumulated evidence of Web-based text patterns and formatted tables, in both cases without needing any manually annotated data.
Instead of assuming that general patterns are more effective than memorizing specific phrases, today’s translation models introduce general rules only when they improve translation over just memorizing particular phrases (for instance, in rules for dates and numbers). Similar observations have been made in every other application of machine learning to Web data: simple n-gram models or linear classifiers based on millions of specific features perform better than elaborate models that try to discover general rules.
It’s a smart approach, and in general I laud it. For certain specific class of problems, it is undoubtedly the correct approach to take. My question, however, is how big that specific class really is. How many problems are there for which we have enough data? My guess is that there are relatively few. I foresee potentially only a small “head” of problems for which big data is available, but a “fat belly” and “long tail” of important problems that will never have the required amount data necessary to take this approach. The authors of this paper even say:
In many cases there appears to be a threshold of sufficient data. For example, James Hays and Alexei A. Efros addressed the task of scene completion: removing an unwanted, unsightly automobile or ex-spouse from a photograph and filling in the background with pixels taken from a large corpus of other photos. With a corpus of thousands of photos, the results were poor. But once they accumulated millions of photos, the same algorithm performed quite well.
So if someone has a picture of herself and her ex-spouse standing in front of their old house in Wilford, Idaho…or better yet in the kitchen of that aforesaid house, Google is telling me that I need millions of pictures of that scene before I can reliably use their algorithms? Ain’t never gonna happen. Any way you cut it, that is unacceptable. That limits the usefulness of this approach to a very small class of problems.. that lone picture of you and the ex-spouse in front of the Eiffel tower. And that’s it. You are sore out of luck for the hundreds of other photos of you and your spouse. If you’re so concerned about removing your ex-spouse from that one picture, a better use of resources might be to simply take another trip, yourself, to Paris and snap another picture.
The authors conclude:
So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and gather some data, and see what it can do.
By all means, this avenue of research needs to continue. But my feeling is that there is also no need to give up on intelligently constructed, small-data, fancy algorithms any time soon. The world is full of a fat belly and long tail of problems for which we will never have enough “head” data for the simple algorithm approach to be useful. I may be wrong, but I simply cannot imagine a world in which we will ever have millions of photos of Fred and Ina’s kitchen in Wilford, Idaho. Especially pictures from before they remodeled and got rid of those pink formica countertops.
Update 1: Wow, I thought I had a few problems with this paper. Stefano Mazzocchi really did not like it (via Daniel T.)
Update 2: Peter Norvig comments here, and writes a longer explanation here.
Pingback: Information Retrieval Gupf » Large Data versus Limited Applicability
It is been some time since I read the Hays and Efros paper, but as I recall it, it does not work like you described. It is more like “we need millions of photographs so a few dozen happen to be similar to the one you want to inpaint and them we choose the one that fits best”. It might still fail in the particular example you’ve given but it does not require millions of photographs taken in the same setting. Of course, there are plenty of domains where it is pretty difficult to gather thousands of images, let alone millions (some medical image modalities come to mind).
Carlos, you’re right, I mis-stated the core of the paper when I wrote: “So if someone has a picture of herself and her ex-spouse standing in front of their old house in Wilford, Idaho…or better yet in the kitchen of that aforesaid house, Google is telling me that I need millions of pictures of that scene before I can reliably use their algorithms?”
What I should have said was: “So if someone has a picture of herself and her ex-spouse standing in front of their old house in Wilford, Idaho…or better yet in the kitchen of that aforesaid house, Google is telling me that I need millions of pictures, because only after amassing that many pictures will I have a chance of finding those one or two pictures taken at exactly the correct angle, with the correct exposure, and at the correct time of day (afternoon sunlight vs. nighttime tungsten) so that I can fill in the correct background. What I do not believe, though, is that even after collecting billions of pictures, you will have exactly the right one to fill in the background. So instead, I think it would be better to have an algorithm that let you explicitly specify and upload 2-3 slightly-wrong pictures of the scene and use fancy algorithms to distort, skew, alter white balance, and otherwise reconstruct occluded objects, so as to produce the one ‘correct’ background.”
So I was wrong to have stated it the first way. Apologies.
The point that I was trying to make, however, remains the same: Unless you are talking about generic or popular backgrounds, such as the ocean or the Eiffel tower, it doesn’t matter how many millions or billions of photographs you have assembled. If you don’t have a picture of Fred and Ina’s kitchen taken from the right angle at the right time of day, you never will.
I made a similar comment here. In case the link goes down, here is a reproduction of that comment:
Here’s the problem I have with the method: It works really well if your task is to fill in image spaces with generic, through semantically consistent, backgrounds.
What if, however, the background should be filled with something that you know really is supposed to be there. For example, suppose you have a picture of Fred and Ina, taken in their kitchen in Wilford, Idaho. You want to remove Fred from the picture. But behind Fred is that vase that they picked up on their trip to France in the late 1970s. And just next to the face is the old clock handed down through the generations from Ina’s family, from ancestors who used to be clockmakers in Switzerland. And below the clock is the old WWI photo of Ina’s Great Uncle.
Now, you want to remove Fred from the picture, but not replace him with any old generic kitchen kitsch shelving. You want to replace it with what is really there in the picture.
How does big data help? I don’t really think it does.
With smart algorithms, on the other hand, you could get the user to provide the algorithm with just 2-3 pictures of the same scene, taken from different angles, and then have the algorithm reconstruct what really is behind Fred.
So that’s my only point. Simple-method-big-data works great if you simply want a generic, albeit semantically meaningful, fill-in. But if you’re really trying to replace what is behind the removed element, I don’t see how big data helps you. As I often argue, it comes down to the task you are trying to solve.
Does this clarify things a bit?