Google researchers Alon Halevy, Peter Norvig, and Fernando Pereira have an article in IEEE Computer magazine entitled “The Unreasonable Effectiveness of Data“. The article continues a theme that has been running strong within Google circles for the past half decade about how training a simple algorithm with larger amounts of data is more effective than having a smart algorithm that tries to generalize or draw inferences from smaller amounts of data:
The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available. For instance, we find that useful semantic relationships can be automatically learned from the statistics of search queries and the corresponding results or from the accumulated evidence of Web-based text patterns and formatted tables, in both cases without needing any manually annotated data.
Instead of assuming that general patterns are more effective than memorizing specific phrases, today’s translation models introduce general rules only when they improve translation over just memorizing particular phrases (for instance, in rules for dates and numbers). Similar observations have been made in every other application of machine learning to Web data: simple n-gram models or linear classifiers based on millions of specific features perform better than elaborate models that try to discover general rules.
It’s a smart approach, and in general I laud it. For certain specific class of problems, it is undoubtedly the correct approach to take. My question, however, is how big that specific class really is. How many problems are there for which we have enough data? My guess is that there are relatively few. I foresee potentially only a small “head” of problems for which big data is available, but a “fat belly” and “long tail” of important problems that will never have the required amount data necessary to take this approach. The authors of this paper even say:
In many cases there appears to be a threshold of sufficient data. For example, James Hays and Alexei A. Efros addressed the task of scene completion: removing an unwanted, unsightly automobile or ex-spouse from a photograph and filling in the background with pixels taken from a large corpus of other photos. With a corpus of thousands of photos, the results were poor. But once they accumulated millions of photos, the same algorithm performed quite well.
So if someone has a picture of herself and her ex-spouse standing in front of their old house in Wilford, Idaho…or better yet in the kitchen of that aforesaid house, Google is telling me that I need millions of pictures of that scene before I can reliably use their algorithms? Ain’t never gonna happen. Any way you cut it, that is unacceptable. That limits the usefulness of this approach to a very small class of problems.. that lone picture of you and the ex-spouse in front of the Eiffel tower. And that’s it. You are sore out of luck for the hundreds of other photos of you and your spouse. If you’re so concerned about removing your ex-spouse from that one picture, a better use of resources might be to simply take another trip, yourself, to Paris and snap another picture.
The authors conclude:
So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and gather some data, and see what it can do.
By all means, this avenue of research needs to continue. But my feeling is that there is also no need to give up on intelligently constructed, small-data, fancy algorithms any time soon. The world is full of a fat belly and long tail of problems for which we will never have enough “head” data for the simple algorithm approach to be useful. I may be wrong, but I simply cannot imagine a world in which we will ever have millions of photos of Fred and Ina’s kitchen in Wilford, Idaho. Especially pictures from before they remodeled and got rid of those pink formica countertops.