General – Information Retrieval Gupf

They Won People Over By A Logical Argument

jeremy — Fri, 10 Jun 2011 10:41:37 +0000

Via @glinden, I enjoyed this article on why GDrive (an early cloud document/file store) was never launched by Google:

At the time [2008], Google was about to launch a project it had been developing for more than a year, a free cloud-based storage service called GDrive. But Sundar [Pichai] had concluded that it was an artifact of the style of computing that Google was about to usher out the door. He went to Bradley Horowitz, the executive in charge of the project, and said, “I don’t think we need GDrive anymore.” Horowitz asked why not. “Files are so 1990,” said Pichai. “I don’t think we need files anymore.”

Pichai apparently went on to explain in more detail why files are no longer needed. It has to do with the notion that, in the cloud you just have data and information. Organizing that information into files is not necessary, especially when you can just start editing that information directly in Google Docs. I’m going to ignore for a moment the “don’t be evil” ramifications of data portability and lock-in that comes through the dissolution of explicit files — how am I supposed to export my data into the Microsoft Cloud Word or into Open Office or into VisiWord whatever else I’d like to use, if files do not exist? Instead, I’m going to focus on how this decision was arrived at:

When Pichai first proposed this concept to Google’s top executives at a GPS—no files!—the reaction was, he says, “skeptical.” [Linus] Upson had another characterization: “It was a withering assault.” But eventually they won people over by a logical argument—that it could be done, that it was the cloudlike thing to do, that it was the Google thing to do. That was the end of GDrive: shuttered as a relic of antiquated thinking even before Google released it. The engineers working on it went to the Chrome team.

This is what I find absolutely fascinating. Here is a company that A/B tests everything in a heavily data driven manner, down which of 41 shades of blue the link anchortext should be. So you would think that such a momentous decision about killing the whole GDrive project would be data driven. It was not. I quote again:

But eventually they won people over by a logical argument—that it could be done, that it was the cloudlike thing to do, that it was the Google thing to do.

Here is an instance where an important decision potentially very large service was made not by the data, but by a HiPPO, the highest-paid person in the room. Granted, that HiPPO did not just come out and declare his or her omnipotent will. Reason and logical argumentation were still needed. But reason and logical argumentation were all that was needed. Nobody had to go out and “prove the idea with code”, as Silicon Valley loves to say. Code was written for GDrive, but the code itself did not provide the proof of its own non-release. And the all-powerful Big Data didn’t even begin to enter into the equation. What provided the proof was a core logical argument, coupled with a strong vision for the future (“it was the cloudlike thing to do”) with an ounce of emotional appeal (“that it was the Google thing to do”).

This is a very refreshing story and I am heartened and encouraged by it. The reason this is exciting is that much of the research that I work on, such as iterative relevance feedback and explicit collaboration, is work that does not have an immediate outlet in the consumer search world. It might take years before the average user is ready to engage with some of these tools and techniques, rather than the typical five-month lifecycle of your average prove-with-code, throw-it-against-the-wall-see-if-it-sticks data-driven feature release. Furthermore, it takes much longer to develop some of this research, as it is more risky and exploratory, and the market might not be ready for it for a long time. At the same time, however, if one waits to start developing such technologies until the market is actually ready, then it is already too late.

For example, the common wisdom for over a decade was that users were too lazy or too unwilling to provide explicit relevance judgments on the information or documents with which they are interacting. So none of these tools were developed. All of a sudden, the Facebook “Like” button took off, and pretty soon the “+1” button was added. In complete contradiction and defiance to ten years of “prove it with the data” arguments about users being unwilling to explicitly mark the relevance of their information.

The way around this problem is to be willing to let a HiPPO make a decision — based on logical argument rather than on log data or usage data — thereby clearing the organization to move forward with that decision. Start working on tools for explicit judgment years ago, and you will be ready with a fantastic solutions once the marketplace catches up. Are all such HiPPO decisions going to be correct? Of course not. But will fewer opportunities be missed, because you are unwilling to use logical argumentation to carve out a bold new vision for the future? Yes.

Don’t get me wrong; data-driven decision making is very useful. But it is useful for incremental improvements. If you want to take big leaps forward, such as the leap Google wanted to take in 2006 with its vision of the cloud, that requires a HiPPO being able to win people over — or being won over — by logical argument.

More on Simplicity and the Paradox of Choice

jeremy — Wed, 23 Jun 2010 16:12:48 +0000

I came across an interesting blogpost today, entitled “The Paradox of Choice is Not Robust“. To requote their quote:

Benjamin Scheibehenne, a psychologist at the University of Basel, was thinking along these lines when he decided (with Peter Todd and, later, Rainer Greifeneder) to design a range of experiments to figure out when choice demotivates, and when it does not.

But a curious thing happened almost immediately. They began by trying to replicate some classic experiments – such as the jam study, and a similar one with luxury chocolates. They couldn’t find any sign of the “choice is bad” effect. Neither the original Lepper-Iyengar experiments nor the new study appears to be at fault: the results are just different and we don’t know why.

After designing 10 different experiments in which participants were asked to make a choice, and finding very little evidence that variety caused any problems, Scheibehenne and his colleagues tried to assemble all the studies, published and unpublished, of the effect.

The average of all these studies suggests that offering lots of extra choices seems to make no important difference either way.

I’ll let that speak for itself, and will note only a few of my related blog posts from a year+ ago: Google Search Options and the Paradox of Choice and Ranked Lists and the Paradox of Choice.

Simplicity: Sparsity or Storytelling?

jeremy — Thu, 10 Jun 2010 17:39:00 +0000

A tweet by @akumar prompted me to punch up this quick blogpost:

as with all controversial issues, there’s a positive in google trying bing/image – that they’re not afraid to learn from competition

What Amit is referring to is the recent addition of gorgeous photographic images as search page background. See for example this writeup: http://blogs.abcnews.com/theworldnewser/2010/06/google-vs-bing-copycat-picture-on-prominent-page.html

He is of course correct; Google is learning from the competition. But there is another issue at play here, one that I don’t want to overlook because I feel it is very important. It is the issue of simplicity. What is simplicity? How is it defined? How is it measured? Conversely, what is complexity? What is clutter?

For over a decade now, Google has essentially defined simplicity as sparsity. Sparse backgrounds, lots of negative space, sparse color schemes, sparse auxiliary information (e.g. query term suggestions on the SERP page have only started appearing in the last year or two, despite the fact that such features existed 15 years ago in search engines of old such as Infoseek and Altavista). The reason given was that people didn’t like clutter, that people like simplicity. And in Google’s definition, simplicity equals sparsity.

I agree. People do like simplicity. I don’t question the veracity of that general sentiment. What has always bothered me, though, is the equivocation of simplicity with sparsity. I think a much better definition of simplicity is not the amount of information or colors or negative space on a page, but the story that a design, interface, interaction, or algorithm tells. Something with a lot of colors and links and words can still be simple…if it tells a clear story! Conversely, something with fewer colors and links (sparser) can be more complex, if the story that it communicates is muddy and not as purposely focused.

This brings us to the Bing background image. In my opinion, the even though the inclusion of a background image is less sparse and more “cluttered” (more colors, more shapes, more textures), it actually assists in the telling of a clearer story. Why? Because it more cleanly separates foreground and background, subject and frame. It provides compositional balance to the page. The white query input box on white background (10+ years of Google design) is sparser, but the story that it tells is less clear because foreground and background are not as cleanly separated. A white query input box on a richly colored and textured background tells a clearer, simpler story because the background image frames and separates the foreground query input box. Furthermore, because you can now distinguish background and foreground, you can more clearly see that the query input box lies near the pleasing “rule of thirds” line, which aids further in the overall storytelling.

In short, I applaud this move by Google, just as I applaud it from Bing. I never liked the white-on-white, because sparsity is not the same thing as simplicity. Simplicity arises through good storytelling, not through minimalism. No A/B testing will tell you this, though. It’s a definitional issue that must be defined before you start your A/B tests. Google has learned from the competition, as @akumar says. But I hope that the lesson Google has learned is not just that users like pretty pictures. I hope the lesson is that, when it comes to simplicity, there is a difference between sparsity and storytelling.

See also my posts: The Tyranny of Simplicity, The Tyranny of Simplicity, Redux, and The Craft of Storytelling. I also found this older discussion on Google’s Lively to be a fascinating read. In my understanding, the issue of “necessary complexity” that the author of that post hammers home about is related to the issue of storytelling. Too much sparsity (of interaction in Lively’s case) leads to an inability to tell a clear story. Simplicity is storytelling, not sparsity.

Search in Social Media

jeremy — Fri, 29 Jan 2010 16:22:41 +0000

What is Social Search as opposed to Social Media? Social Search in Media? Search in Social Media?

Next week, Gene Golovchinsky and I are moderating a pair of panels at the SSM workshop. So we spent some time this week asking ourselves these definitional questions in preparation for the panel. We came up with a lightweight taxonomy, and have done a few classifications/examples of existing systems into that taxonomy. Whether or not you are one of the 80 participants in the workshop, I would invite you to take a look at our framework and comment or critique where necessary. Here’s the link to Gene’s writeup:

We think the phrase ’search in social media’ has been used to refer to both the information being searched, and to the process for doing so. The information is standard user-generated content — tweets, blog posts, comment threads, tags, etc. The process, however, seems less well understood…It will be interesting to see how these ideas will be transformed by the discussion at the workshop. In any case, having a language with which to talk about phenomena is a prerequisite to articulating a research agenda, particularly in a young and multi-disciplinary field.

Please note, however, that one topic that will probably not be covered is the difference between social search (process) and collaborative search (process). The latter workshop will be held a few days later at CSCW. For an interesting thread on the distinction between the two, please see another FXPAL post from March of last year.

Google and the Meaning of Open

jeremy — Tue, 22 Dec 2009 11:44:43 +0000

There is a fantastic Google blog post today by Jonathan Rosenberg on the meaning (and value) of openness. Whooo-boy.. where do we start with this can of worms? Guess I’ll jump right in. Warning: This is probably the longest post I’ve written, so if you are easily bored, understand that this is not required reading. It will not be on the test.

Here we go:

At Google we believe that open systems win. They lead to more innovation, value, and freedom of choice for consumers, and a vibrant, profitable, and competitive ecosystem for businesses.

Agreed! I’m fully on board the spirit of this opening statement!

Many companies will claim roughly the same thing since they know that declaring themselves to be open is both good for their brand and completely without risk.

True. So the question arises: What happens when being open carries with it an amount of risk? Do you open up those areas of your business as well? Or do you forever keep your most valuable layer of the stack closed and proprietary, both in terms of closed source as well as not-fully-open information?

We run the company and make our product decisions based on these principles, so I encourage you to carefully read, review, and debate them. Then own them and try to incorporate them into your work. This is a complex subject and if there is debate (and I’m sure there will be) it should be in the open! Please feel free to comment.

I like the spirit of this discussion so far. I earnestly believe that Google is debating these things internally. But I also take them at their word that they would like this debate to be in the open. Consider this blog post part of my ongoing comment, and ongoing engagement in what I consider to be an extremely important area: The organization and dissemination of information.

There are two components to our definition of open: open technology and open information. Open technology includes open source, meaning we release and actively support code that helps grow the Internet, and open standards, meaning we adhere to accepted standards and, if none exist, work to create standards that improve the entire Internet (and not just benefit Google). Open information means that when we have information about users we use it to provide something that is valuable to them, we are transparent about what information we have about them, and we give them ultimate control over their information.

Ok, first question: Why does open information only include the information that you collect about users? Why does it not include information about what you are trying to do for users, and how, and why? Why does it not include what metrics you are optimizing, so that the user can understand what the search ranking functions are trying to do for him or her. For example, is Google open enough to enable me to know how much diversity is intentionally being injected into my search results? Will they allow me to know to what extent my particular query being optimized for precision or for recall? Will they allow me to know what factors went into the decision to rank page X higher than page Y, and can I change those factors, so that I am able to instruct the search engine to give me different sets of results for the same exact query term, so as to maximize my value for my own understanding of my own particular information need? Never mind whether or not most users would even want to do something like this. Most probably do not, but many (more than 5%, I have little doubt) do. Is Google open (transparent) enough, in principle, to ever allow a user to make use of the service in this manner? Or is this something that will forever be hidden from the user’s view?

More importantly: Why is the information shown to the user not symmetric? Google stores (and uses) more about the user than the user is allowed to store (and use) about Google. Let me give a concrete example: When I run a query, Google knows a number of piece of information that arises out of my information-generating actions. It knows:

The text of the query itself, along with the timestamp
All the results that are mutually generated from intersection of the query (my intellectual product) and the algorithm (Google’s intellectual product)
The link(s) that I clicked as a result of that query
The link(s) that I didn’t click as a result of that query

Last I checked, Google allows me to access and export (1) and (3). But they do not allow me to access and export (2) and (4). It’s not that I didn’t interact with those results, or have a shared hand in their creation. I created the query in the first place. And then for a number of results, I viewed them and explicitly made a choice (judgment) not to visit certain pages. Just because there was no click does not mean that there isn’t any information. This idea is counterintuitive at first, but it’s true. There was an interface, an option about whether or not to click, and a decision not to click. I created that information during my search session. So why can I not export that information? Why can I not get a list of all the results that I viewed, that I decided not to click?

If this information were open, especially through an API, I could start to do all sorts of interesting things with it. I could keep track of whether or not there were certain pages that kept coming up in the results, over and over. And maybe that would allow me to reevaluate my initial non-relevance decision and look deeper into a piece of information that had initially not appeared fruitful (not had much of an information scent associated with it). Being able to keep API, algorithmic track of all the pages I didn’t click would also allow me to compare and contrast relevance and non-relevance information on Google with other search engines such as Yahoo! and Bing. It would allow an ecosystem of services providers to grow up around Google and provide me with software solutions that allowed me to keep tabs on all the search engines simultaneously and understand the relative differences between and relative merits of each. In other words, being able to API-download both my clicks and my non-clicks would allow me to metasearch Google! (And there is a long history of academic literature on the value of metasearch; I don’t need to go into it here.)

I am not talking about some third party company scraping Google’s results and displaying them on their own website for their own purposes. I’m talking about being able to use software (that I’ve licensed from a third party) to do it myself, for myself. For more on this, see Phil Windley’s posts, It’s My Browser and I’ll Auto-Click If I Want To and Claiming My Right to a Purpose-Centric Web and Jon Udell’s post Magic Glasses and Magic Projectors: Private Versus Public Augmentation of Experience (see also the comments/discussion in this latter post). As Windley writes, the issue is this: Do people have the right to control how Web content is displayed in their browser? Openness of search data and information, including all the data associated with that mutually-interactive process (both clicked and non-clicked results) is the cornerstone of transparency and end-user value. In order to claim openness, and to establish end-user value, I feel that there has to be a completely symmetry between everything that Google stores (and makes use of, internally) about the user, and everything that the user is allowed store (and make use of, internally) about Google.

So is there a symmetry? No. Read the Terms of Service:

The Google Services are made available for your personal, non-commercial use only…{snip}…You may not take the results from a Google search and reformat and display them, or mirror the Google home page or results pages on your Web site. You may not “meta-search” Google…{snip}…You may not send automated queries of any sort to Google’s system without express permission in advance from Google. Note that “sending automated queries” includes, among other things: using any software which sends queries to Google to determine how a website or webpage “ranks” on Google for various queries; “meta-searching” Google; and performing “offline” searches on Google. Please do not write to Google to request permission to “meta-search” Google for a research project, as such requests will not be granted.

Compare that to what was being said above:

Open information means that when we have information about users we use it to provide something that is valuable to them, we are transparent about what information we have about them, and we give them ultimate control over their information.

To me, that does not sound like I have ultimate control over my information…my queries, my clicks, and my non-clicks. If I had ultimate control over those, then I could take those clicked and non-clicked links and use them in any way that I wanted. I could re-search those links, offline, without having to return to Google (value to the user: this response time would be much quicker, and internet connectivity would not be required — privacy could also be enhanced!) I could mash those links up with other search results, from both Google and non-Google sources (metasearch; value to the user: better results, more diverse results, more complete results). As Google says:

Another way to look at the difference between open and closed systems is that open systems allow innovation at all levels — from the operating system to the application layer — not just at the top. This means that one company doesn’t have to depend on another’s benevolence to ship a product.

That is, if I want to install software that lets me metasearch Google, the creator of that software should not have to depend on Google’s benevolence in order to be able to create their product. Openness at the system level is what allows a third-party company develop the software, and Google-to-user openness on the data level is what lets me, the user, put my data into that software to mash up all of my queries, clicks, and non-clicks, essentially metasearching Google. This is exactly the sort of scenario that openness is meant to address.

Continuing on:

If we can embody a consistent commitment to open — which I believe we can — then we have a big opportunity to lead by example and encourage other companies and industries to adopt the same commitment. If they do, the world will be a better place.

And:

If they use our products and store content with us, it’s their content, not ours. They should be able to export it or delete it at any time, at no cost, and as easily as possible.

Again, I agree. When all search engines, Google included, allow me to store (and reuse!) my search interaction data (all queries, clicks, and non-clicks) then this data truly becomes valuable to me, the user. Yahoo! has taken much more of a lead in this area with SearchMonkey. Not only do I not see the same openness from Google, I see Terms of Service that actively discriminate against these sorts of applications and usage of one’s own data.

Open systems are just the opposite. They are competitive and far more dynamic. In an open system, a competitive advantage doesn’t derive from locking in customers, but rather from understanding the fast-moving system better than anyone else and using that knowledge to generate better, more innovative products. The successful company in an open system is both a fast innovator and a thought leader; the brand value of thought leadership attracts customers and then fast innovation keeps them. This isn’t easy — far from it — but fast companies have nothing to fear, and when they are successful they can generate great shareholder value. Open systems have the potential to spawn industries. They harness the intellect of the general population and spur businesses to compete, innovate, and win based on the merits of their products and not just the brilliance of their business tactics.

Make no mistake: I greatly admire the ideals being expressed here. I cannot stop agreeing. I just have a hard time reconciling this with what is written later on in the same post:

While we are committed to opening the code for our developer tools, not all Google products are open source. Our goal is to keep the Internet open, which promotes choice and competition and keeps users and developers from getting locked in. In many cases, most notably our search and ads products, opening up the code would not contribute to these goals and would actually hurt users. The search and advertising markets are already highly competitive with very low switching costs, so users and advertisers already have plenty of choice and are not locked in.

By only making part of one’s information available (queries and clicks) and not the other part (seen but non-clicked results that were created by the user+Google collaborative query+algorithm search session, as well as unseen and non-clicked results from that same session), it makes it much more difficult for a user to migrate to another service. For example, what if Bing were to start offering full-time, default personalization the same way Google now does? Google undoubtedly trains one’s personalized algorithm using all of one’s own information: queries, clicks and non-clicks. Abandoned searches (and knowledge of exactly which results were not clicked) are just as much an important part of the overall algorithmic mixture, are they not? They convey important information. So suppose Bing now wanted to allow you, the user, to jump headfirst into personalized Bing results. The best way to do that would be to upload your Google search data to Bing, so that Bing could start personalizing based on this same years-long history. Google does not allow this. That information is not available for you to use, in any third-party application…whether metasearch or Bing-provided personalized search. Saying that a user is only one click away from another search engine masks the full truth. If the first search engine has learned something about a user and can therefore provide different, personalized results for that user based on months or years of history, then that search engine is stickier than others. In order to be truly open, a user has to be able to “export” his or her profile, both the clicks and the non-clicks, and take that information to another search engine, or else there will be lock-in.

The Google post continues:

Not to mention the fact that opening up these systems would allow people to “game” our algorithms to manipulate search and ads quality rankings, reducing our quality for everyone.

Ok, here is where I have to take another strong, contrary stance. Google just got finished saying the following, earlier in the post:

Open systems are just the opposite. They are competitive and far more dynamic. In an open system, a competitive advantage doesn’t derive from locking in customers, but rather from understanding the fast-moving system better than anyone else and using that knowledge to generate better, more innovative products.

So let me get this straight: Google is pro-openness in all the areas where they don’t make any money, where openness doesn’t actually affect the bottom line. But when it comes to the moneymaker, that has to be closed and proprietary to protect it from spammers? But I thought that open systems are just the opposite. Doesn’t competition and dynamicism create an environment in which everyone can work on the problem of fighting spam, thereby coming up with a better solution and a quicker solution than a closed environment can produce? Isn’t one of the maxims of openness the idea that the smartness of the competitive environment as a whole can far outproduce any one company? Isn’t that the whole idea behind the Netflix prize (for example) that as everyone shares with each other their solutions, everyone gets better than any one team could have done on their own?

Opening up Google’s algorithms might allow people to “game” them for a short time. But in the long run, the benefits that come from openness would outpace the spammers’ ability to game. Right? Isn’t that the technological optimism that this blog post is expressing?

Our skills and our culture give us the opportunity and responsibility to prevent this from happening. We believe in the power of technology to deliver information. We believe in the power of information to do good. We believe that open is the only way for this to have the broadest impact for the most people. We are technology optimists who trust that the chaos of open benefits everyone. We will fight to promote it every chance we get. Open will win. It will win on the Internet and will then cascade across many walks of life: The future of government is transparency. The future of commerce is information symmetry. The future of culture is freedom. The future of science and medicine is collaboration. The future of entertainment is participation. Each of these futures depends on an open Internet.

Open will win. Open will beat the spammers, will it not? Hasn’t Google just expressed confidence that it will? Simultaneously, open will better serve the users. And open will allow third-parties to create exploratory- and recall-oriented and social and collaborative search systems that make use of Google algorithms and indices as just another layer in the overall information organization and dissemination stack. This grows the overall pie. Remember:

Another way to look at the difference between open and closed systems is that open systems allow innovation at all levels — from the operating system to the application layer — not just at the top. This means that one company doesn’t have to depend on another’s benevolence to ship a product.

Search is another layer in that stack, one that should be just as open as the other layers, and one that need not have fear from gaming, because openness will overcome. That’s the call-to-arms that I see expressed in this blog post. And it gets me excited. At the same time, I see an elephant in the room. And that is Google’s unwillingness to be open in the one (and mainly only) area where money is made. And what really bothers me about that is that it seems no different from any other technology company: All technology companies wants openness in those layers of the stack where they don’t make money, and closedness where they do. Google does not seem any different.

Even Google’s excuse for not being open is very similar to others, like Microsoft’s. Microsoft, which makes the bulk of its money on Windows, says that Windows can’t be open-sourced because hackers will see all the bugs in the code and be able to more easily exploit the OS and write viruses. Companies like Google say “nonsense”, and point to open source OSes like Linux as an example of how openness can breed hardened code that is less hackable. And artists and record labels and newspapers also worry about bootleggers and content-copiers depriving them of their stack-layer income, despite absolutist assurances from Google that increased traffic guarantees monetization. So why does Google think that if we open-sourced search algorithms, the community would not be able to “harden” those algorithms against spammers, and simultaneously guarantee Google’s income? There may be a little chaos at first, but the chaos of open benefits everyone.

Toward the end, Google wraps up:

All of this is useless, however, if we fail when it comes to being open. So we need to constantly push ourselves. Are we contributing to open standards that better the industry? What’s stopping us from open sourcing our code? Are we giving our users value, transparency, and control? Open up as much as you can as often as you can, and if anyone questions whether this is a good approach, explain to them why it’s not just a good approach, but the best approach. It is an approach that will transform business and commerce in this still young century, and when we are successful we will effectively re-write the MBA curriculum for the next several decades!

Make no mistake, I am on board with the stated goals. However, in order to rewrite that MBA curriculum, I need to see Google be just as open at their moneymaking core stack layer (search and ads) as they want everyone else to be with operating systems, networks, and intellectual property (books, music, news articles, etc.) It is just a little too convenient to have an excuse about why one’s own layer cannot be opened up (spammers), but that everyone else’s layers can, hackers and bootleggers be damned.

I feel very strongly about all this, but that does not mean that I am correct. Rather, I am taking Google seriously at its word: “I encourage you to carefully read, review, and debate [these principles]. Then own them and try to incorporate them into your work. This is a complex subject and if there is debate (and I’m sure there will be) it should be in the open! Please feel free to comment.” Consider this blog post my first of many comments. Not for the purpose of tearing down, or argument for argument’s sake. But for the purpose of getting these challenges and questions and comments out in the open, exactly as requested, in order to further the same end goal. It would be fantastic if Google succeeded, and I agree with them: It’s not just a good approach, but the best approach. But in order to start transforming business and commerce, Google needs to set the example by being open in its core area (search and ads) and trusting that the chaos of openness will defeat spammers in addition to hackers and bootleggers. Right now, for all the openness in non-core business areas, and for all the talk, Google is unwilling to be open where it really matters.

Update: TechCrunch makes almost exactly the same points, without going into as much detail as I have above. Still, I think the details that I mention add an important layer to the overall discussion. First, I have pointed out how Google can be open about its end search results/data without having to open its algorithms (via allowing metasearch and via port-to-Bing personalized search). The problem is that doing so would disintermediate Google, and push them down from the top of the stack. Why? Because now users could build all sorts of applications on top of Google search results, instead of going through the ad-filled Google interface. So openness is a problem when it comes to making money. I also think there is value in pointing out that, were the algorithms themselves to be opened up, it’s not like search is the only industry that has to contend with miscreants. OS developers have to deal with hackers, content creators (musicians, authors) have to deal with bootleggers. So if you ask the OS to go open-source, and the musician to go DRM-free, what’s so different about asking the search engine to go open-algorithm?

Update 2: Also check out Chris Dixon’s post (http://cdixon.org/2009/12/22/google-should-open-source-what-actually-matters-their-search-ranking-algorithm/) and Danny Sullivan’s comments (http://cdixon.org/2009/12/22/google-should-open-source-what-actually-matters-their-search-ranking-algorithm/#comment-27024421), both of which are quite similar in spirit to where I am coming from on this matter. Also interesting is Harvard Business School Prof. Tom Eisenmann’s take (http://platformsandnetworks.blogspot.com/2009/12/googles-svp-product-management-jonathan.html). If discussion is what Google wants, discussion is what Google gets

The Craft of Storytelling

jeremy — Thu, 05 Nov 2009 11:50:39 +0000

I’ve been playing around with some old TREC data over the past few days and completely by chance I came across this document. I find it interesting because storytelling is a good metaphor for what we as researchers do when we construct interactive information seeking systems. The document is short enough that I think I can reproduce it here in its entirety without getting into intellectual property trouble. I hope.

DOCNO: LA070590-0123

DOCID: 243123

July 5, 1990, Thursday, Home Edition

Calendar; Part F; Page 1; Column 1; Calendar Desk

57 words

QUOTABLE

“Networks are run by people whose weakest suit is that they can’t understand the importance of the craft of storytelling, which is what film and television are all about. . . . They can do statistical things, but they can’t quantify storytelling and put it into a computer.”

Writer-producer Roy Huggins, in Television & Families magazine

Wikipedia’s take on Roy Huggins.

Tomorrow’s Data

jeremy — Mon, 02 Nov 2009 12:38:41 +0000

Jeff Dalton recently wrote about why he doesn’t want your search log data. It is an interesting read, and I recommend going through the whole article and comments. But I want to call attention to one thought in particular:

Academia should be building solutions for tomorrow’s data, not yesterday’s. What will the queries and documents look like in 5 or even 10 years and how can we improve retrieval for those? It’s not an easy question to answer, but you can watch Bruce Croft’s CIKM keynote for some ideas…I still believe in empirical research. However, I’m also well-aware that over-reliance on limited data can lead to overfitting and incremental changes instead of ground-breaking research. To use an analogy from Wall Street, we become too focused on quarterly paper deadlines and lose sight of the fundamental science.

It is a provokative thought, and I find it compelling. By spending too much effort paying attention to yesterday’s — and even today’s — data, you wind up limiting yourself to the existing, visible gradient. At the same time, an open question is how one develops for tomorrow’s data when that data by definition does not yet exist. This is a question that I hope to address more in the upcoming months. Not answer, but address. Most likely by pointing to work by other researchers not directly working on the IR task (as I’ve done a bit in the past). Developing for tomorrow’s data is not an easy task, but it is also worth not dismissing just because it is too far beyond the needs of today’s users.

See also: AT&T Labs vs. Google Labs: Not your grandfathers R&D (Ars Technica, 2006)

There’s no doubt that the information economy continues to create a lot of wealth, but I think it’s fair to ask if it’s also creating enough science to replenish the stock of scientific capital that it’s still burning through. I think it’s clear that chaotic, market-driven change is a good way to bring ideas quickly and efficiently from concept to profitable product. However, such a rapid churning of the institutional and cultural landscape ultimately may not be conducive to the kind of steady, expensive, long-term investment in fundamental research that produces the really big ideas that somewhere, at some completely unforeseeable point in the future, change the world.

Also: “I, Cringely” from October 2002, entitled Eating our Seed Corn

Doing to Music What They Did to the Web

jeremy — Thu, 29 Oct 2009 00:37:44 +0000

I’ve added a couple of updates to my previous post about the “Google Discover Music” service that is launching today. See also Paul’s writeup.

But I have been reading Danny’s Sullivan’s liveblog of the release event, and came across a quote that made me chuckle out loud:

Bill talking about how this will let people hear more diverse music. “They’re [Google Music is] going to do for music what they did for the web.”

Oh my goodness, I hope not! Because what they did for the web is put a popularity filter in front of their content-based search mechanism:

Google search works because it relies on the millions of individuals posting links on websites to help determine which other sites offer content of value. We assess the importance of every web page using more than 200 signals and a variety of techniques, including our patented PageRank algorithm, which analyzes which sites have been “voted” to be the best sources of information by other pages across the web. As the web gets bigger, this approach actually improves, as each new site is another point of information and another vote to be counted. In the same vein, we are active in open source software development, where innovation takes place through the collective effort of many programmers.

I do not want my music retrieval and discovery algorithms to be powered by the millions of individual posting (and click) links in order to help determine which musicians and songs offer content of value. I do not want my music search results to have been “voted” their way into my results list. I do not want such a music search service to get even bigger by counting even more points of information and votes.

If Google ends up doing to music what they did to the web, they will destroy music. Please let it not be so. As Brian Whitman, founder of The Echo Nest, recently said at a conference:

“If we only used collaborative filtering to discover music, the popular artists would eat the unknowns alive.”

Yup.

UPDATE: I just noticed something in this new Google Music service that I hadn’t noticed before: Popups! Check out this explanation video from the official Google blog, starting at 0:34 and going to 0:47. Compare and contrast that with the official Google position on popups on the Google site:

We do not allow pop-up ads of any kind on our site. We find them annoying.

But there is a solution! Google recommends the following:

If you are experiencing pop-ups generated by one of these malicious programs, you may want to remove the pop-up program from your computer.

Hmmm….

Data Liberation and Ownership

jeremy — Fri, 18 Sep 2009 15:31:36 +0000

I split my blogging between this and the FXPAL blog. This morning I have a post on the latter site that asks an (imho) important question about data ownership and data liberation with respect to one’s web search history.. not just the queries, but the results produce by a mashup between those queries and the back-end algorithms. Here is the key point:

Here is an analogy by way of Adobe Photoshop. Suppose you open one of your images in the online (webapp) version of Photoshop, apply the Gaussian Blur (soft focus) filter to the image, and then save that result out again. It’s clear that you own the input (it’s your photo), that Adobe owns the Gaussian Blur algorithm (or at least the implementation of it), and that you own the resulting image. Adobe doesn’t lay ownership claim to the output of the algorithm, even though it was their algorithm that produced the output.

So how is this different from a web search? You own the input (the query string that you type). Google owns the algorithm that transforms that input into a list of results. So wouldn’t you also then own the output of that transformation? Not the algorithm, but the output of the algorithm, i.e. the results set. Just like you own the output in Photoshop.

It will be interesting to see whether or not Google will be open enough to allow you to extract this particular form of your data. Currently, they do not.

I would invite you to visit the FXPAL blog, and read the post in full. And comment/disagree, where necessary.