Google and the Meaning of Open | Information Retrieval Gupf

There is a fantastic Google blog post today by Jonathan Rosenberg on the meaning (and value) of openness. Whooo-boy.. where do we start with this can of worms? Guess I’ll jump right in. Warning: This is probably the longest post I’ve written, so if you are easily bored, understand that this is not required reading. It will not be on the test.

Here we go:

At Google we believe that open systems win. They lead to more innovation, value, and freedom of choice for consumers, and a vibrant, profitable, and competitive ecosystem for businesses.

Agreed! I’m fully on board the spirit of this opening statement!

Many companies will claim roughly the same thing since they know that declaring themselves to be open is both good for their brand and completely without risk.

True. So the question arises: What happens when being open carries with it an amount of risk? Do you open up those areas of your business as well? Or do you forever keep your most valuable layer of the stack closed and proprietary, both in terms of closed source as well as not-fully-open information?

We run the company and make our product decisions based on these principles, so I encourage you to carefully read, review, and debate them. Then own them and try to incorporate them into your work. This is a complex subject and if there is debate (and I’m sure there will be) it should be in the open! Please feel free to comment.

I like the spirit of this discussion so far. I earnestly believe that Google is debating these things internally. But I also take them at their word that they would like this debate to be in the open. Consider this blog post part of my ongoing comment, and ongoing engagement in what I consider to be an extremely important area: The organization and dissemination of information.

There are two components to our definition of open: open technology and open information. Open technology includes open source, meaning we release and actively support code that helps grow the Internet, and open standards, meaning we adhere to accepted standards and, if none exist, work to create standards that improve the entire Internet (and not just benefit Google). Open information means that when we have information about users we use it to provide something that is valuable to them, we are transparent about what information we have about them, and we give them ultimate control over their information.

Ok, first question: Why does open information only include the information that you collect about users? Why does it not include information about what you are trying to do for users, and how, and why? Why does it not include what metrics you are optimizing, so that the user can understand what the search ranking functions are trying to do for him or her. For example, is Google open enough to enable me to know how much diversity is intentionally being injected into my search results? Will they allow me to know to what extent my particular query being optimized for precision or for recall? Will they allow me to know what factors went into the decision to rank page X higher than page Y, and can I change those factors, so that I am able to instruct the search engine to give me different sets of results for the same exact query term, so as to maximize my value for my own understanding of my own particular information need? Never mind whether or not most users would even want to do something like this. Most probably do not, but many (more than 5%, I have little doubt) do. Is Google open (transparent) enough, in principle, to ever allow a user to make use of the service in this manner? Or is this something that will forever be hidden from the user’s view?

More importantly: Why is the information shown to the user not symmetric? Google stores (and uses) more about the user than the user is allowed to store (and use) about Google. Let me give a concrete example: When I run a query, Google knows a number of piece of information that arises out of my information-generating actions. It knows:

The text of the query itself, along with the timestamp
All the results that are mutually generated from intersection of the query (my intellectual product) and the algorithm (Google’s intellectual product)
The link(s) that I clicked as a result of that query
The link(s) that I didn’t click as a result of that query

Last I checked, Google allows me to access and export (1) and (3). But they do not allow me to access and export (2) and (4). It’s not that I didn’t interact with those results, or have a shared hand in their creation. I created the query in the first place. And then for a number of results, I viewed them and explicitly made a choice (judgment) not to visit certain pages. Just because there was no click does not mean that there isn’t any information. This idea is counterintuitive at first, but it’s true. There was an interface, an option about whether or not to click, and a decision not to click. I created that information during my search session. So why can I not export that information? Why can I not get a list of all the results that I viewed, that I decided not to click?

If this information were open, especially through an API, I could start to do all sorts of interesting things with it. I could keep track of whether or not there were certain pages that kept coming up in the results, over and over. And maybe that would allow me to reevaluate my initial non-relevance decision and look deeper into a piece of information that had initially not appeared fruitful (not had much of an information scent associated with it). Being able to keep API, algorithmic track of all the pages I didn’t click would also allow me to compare and contrast relevance and non-relevance information on Google with other search engines such as Yahoo! and Bing. It would allow an ecosystem of services providers to grow up around Google and provide me with software solutions that allowed me to keep tabs on all the search engines simultaneously and understand the relative differences between and relative merits of each. In other words, being able to API-download both my clicks and my non-clicks would allow me to metasearch Google! (And there is a long history of academic literature on the value of metasearch; I don’t need to go into it here.)

I am not talking about some third party company scraping Google’s results and displaying them on their own website for their own purposes. I’m talking about being able to use software (that I’ve licensed from a third party) to do it myself, for myself. For more on this, see Phil Windley’s posts, It’s My Browser and I’ll Auto-Click If I Want To and Claiming My Right to a Purpose-Centric Web and Jon Udell’s post Magic Glasses and Magic Projectors: Private Versus Public Augmentation of Experience (see also the comments/discussion in this latter post). As Windley writes, the issue is this: Do people have the right to control how Web content is displayed in their browser? Openness of search data and information, including all the data associated with that mutually-interactive process (both clicked and non-clicked results) is the cornerstone of transparency and end-user value. In order to claim openness, and to establish end-user value, I feel that there has to be a completely symmetry between everything that Google stores (and makes use of, internally) about the user, and everything that the user is allowed store (and make use of, internally) about Google.

So is there a symmetry? No. Read the Terms of Service:

The Google Services are made available for your personal, non-commercial use only…{snip}…You may not take the results from a Google search and reformat and display them, or mirror the Google home page or results pages on your Web site. You may not “meta-search” Google…{snip}…You may not send automated queries of any sort to Google’s system without express permission in advance from Google. Note that “sending automated queries” includes, among other things: using any software which sends queries to Google to determine how a website or webpage “ranks” on Google for various queries; “meta-searching” Google; and performing “offline” searches on Google. Please do not write to Google to request permission to “meta-search” Google for a research project, as such requests will not be granted.

Compare that to what was being said above:

Open information means that when we have information about users we use it to provide something that is valuable to them, we are transparent about what information we have about them, and we give them ultimate control over their information.

To me, that does not sound like I have ultimate control over my information…my queries, my clicks, and my non-clicks. If I had ultimate control over those, then I could take those clicked and non-clicked links and use them in any way that I wanted. I could re-search those links, offline, without having to return to Google (value to the user: this response time would be much quicker, and internet connectivity would not be required — privacy could also be enhanced!) I could mash those links up with other search results, from both Google and non-Google sources (metasearch; value to the user: better results, more diverse results, more complete results). As Google says:

Another way to look at the difference between open and closed systems is that open systems allow innovation at all levels — from the operating system to the application layer — not just at the top. This means that one company doesn’t have to depend on another’s benevolence to ship a product.

That is, if I want to install software that lets me metasearch Google, the creator of that software should not have to depend on Google’s benevolence in order to be able to create their product. Openness at the system level is what allows a third-party company develop the software, and Google-to-user openness on the data level is what lets me, the user, put my data into that software to mash up all of my queries, clicks, and non-clicks, essentially metasearching Google. This is exactly the sort of scenario that openness is meant to address.

Continuing on:

If we can embody a consistent commitment to open — which I believe we can — then we have a big opportunity to lead by example and encourage other companies and industries to adopt the same commitment. If they do, the world will be a better place.

And:

If they use our products and store content with us, it’s their content, not ours. They should be able to export it or delete it at any time, at no cost, and as easily as possible.

Again, I agree. When all search engines, Google included, allow me to store (and reuse!) my search interaction data (all queries, clicks, and non-clicks) then this data truly becomes valuable to me, the user. Yahoo! has taken much more of a lead in this area with SearchMonkey. Not only do I not see the same openness from Google, I see Terms of Service that actively discriminate against these sorts of applications and usage of one’s own data.

Open systems are just the opposite. They are competitive and far more dynamic. In an open system, a competitive advantage doesn’t derive from locking in customers, but rather from understanding the fast-moving system better than anyone else and using that knowledge to generate better, more innovative products. The successful company in an open system is both a fast innovator and a thought leader; the brand value of thought leadership attracts customers and then fast innovation keeps them. This isn’t easy — far from it — but fast companies have nothing to fear, and when they are successful they can generate great shareholder value. Open systems have the potential to spawn industries. They harness the intellect of the general population and spur businesses to compete, innovate, and win based on the merits of their products and not just the brilliance of their business tactics.

Make no mistake: I greatly admire the ideals being expressed here. I cannot stop agreeing. I just have a hard time reconciling this with what is written later on in the same post:

While we are committed to opening the code for our developer tools, not all Google products are open source. Our goal is to keep the Internet open, which promotes choice and competition and keeps users and developers from getting locked in. In many cases, most notably our search and ads products, opening up the code would not contribute to these goals and would actually hurt users. The search and advertising markets are already highly competitive with very low switching costs, so users and advertisers already have plenty of choice and are not locked in.

By only making part of one’s information available (queries and clicks) and not the other part (seen but non-clicked results that were created by the user+Google collaborative query+algorithm search session, as well as unseen and non-clicked results from that same session), it makes it much more difficult for a user to migrate to another service. For example, what if Bing were to start offering full-time, default personalization the same way Google now does? Google undoubtedly trains one’s personalized algorithm using all of one’s own information: queries, clicks and non-clicks. Abandoned searches (and knowledge of exactly which results were not clicked) are just as much an important part of the overall algorithmic mixture, are they not? They convey important information. So suppose Bing now wanted to allow you, the user, to jump headfirst into personalized Bing results. The best way to do that would be to upload your Google search data to Bing, so that Bing could start personalizing based on this same years-long history. Google does not allow this. That information is not available for you to use, in any third-party application…whether metasearch or Bing-provided personalized search. Saying that a user is only one click away from another search engine masks the full truth. If the first search engine has learned something about a user and can therefore provide different, personalized results for that user based on months or years of history, then that search engine is stickier than others. In order to be truly open, a user has to be able to “export” his or her profile, both the clicks and the non-clicks, and take that information to another search engine, or else there will be lock-in.

The Google post continues:

Not to mention the fact that opening up these systems would allow people to “game” our algorithms to manipulate search and ads quality rankings, reducing our quality for everyone.

Ok, here is where I have to take another strong, contrary stance. Google just got finished saying the following, earlier in the post:

Open systems are just the opposite. They are competitive and far more dynamic. In an open system, a competitive advantage doesn’t derive from locking in customers, but rather from understanding the fast-moving system better than anyone else and using that knowledge to generate better, more innovative products.

So let me get this straight: Google is pro-openness in all the areas where they don’t make any money, where openness doesn’t actually affect the bottom line. But when it comes to the moneymaker, that has to be closed and proprietary to protect it from spammers? But I thought that open systems are just the opposite. Doesn’t competition and dynamicism create an environment in which everyone can work on the problem of fighting spam, thereby coming up with a better solution and a quicker solution than a closed environment can produce? Isn’t one of the maxims of openness the idea that the smartness of the competitive environment as a whole can far outproduce any one company? Isn’t that the whole idea behind the Netflix prize (for example) that as everyone shares with each other their solutions, everyone gets better than any one team could have done on their own?

Opening up Google’s algorithms might allow people to “game” them for a short time. But in the long run, the benefits that come from openness would outpace the spammers’ ability to game. Right? Isn’t that the technological optimism that this blog post is expressing?

Our skills and our culture give us the opportunity and responsibility to prevent this from happening. We believe in the power of technology to deliver information. We believe in the power of information to do good. We believe that open is the only way for this to have the broadest impact for the most people. We are technology optimists who trust that the chaos of open benefits everyone. We will fight to promote it every chance we get. Open will win. It will win on the Internet and will then cascade across many walks of life: The future of government is transparency. The future of commerce is information symmetry. The future of culture is freedom. The future of science and medicine is collaboration. The future of entertainment is participation. Each of these futures depends on an open Internet.

Open will win. Open will beat the spammers, will it not? Hasn’t Google just expressed confidence that it will? Simultaneously, open will better serve the users. And open will allow third-parties to create exploratory- and recall-oriented and social and collaborative search systems that make use of Google algorithms and indices as just another layer in the overall information organization and dissemination stack. This grows the overall pie. Remember:

Another way to look at the difference between open and closed systems is that open systems allow innovation at all levels — from the operating system to the application layer — not just at the top. This means that one company doesn’t have to depend on another’s benevolence to ship a product.

Search is another layer in that stack, one that should be just as open as the other layers, and one that need not have fear from gaming, because openness will overcome. That’s the call-to-arms that I see expressed in this blog post. And it gets me excited. At the same time, I see an elephant in the room. And that is Google’s unwillingness to be open in the one (and mainly only) area where money is made. And what really bothers me about that is that it seems no different from any other technology company: All technology companies wants openness in those layers of the stack where they don’t make money, and closedness where they do. Google does not seem any different.

Even Google’s excuse for not being open is very similar to others, like Microsoft’s. Microsoft, which makes the bulk of its money on Windows, says that Windows can’t be open-sourced because hackers will see all the bugs in the code and be able to more easily exploit the OS and write viruses. Companies like Google say “nonsense”, and point to open source OSes like Linux as an example of how openness can breed hardened code that is less hackable. And artists and record labels and newspapers also worry about bootleggers and content-copiers depriving them of their stack-layer income, despite absolutist assurances from Google that increased traffic guarantees monetization. So why does Google think that if we open-sourced search algorithms, the community would not be able to “harden” those algorithms against spammers, and simultaneously guarantee Google’s income? There may be a little chaos at first, but the chaos of open benefits everyone.

Toward the end, Google wraps up:

All of this is useless, however, if we fail when it comes to being open. So we need to constantly push ourselves. Are we contributing to open standards that better the industry? What’s stopping us from open sourcing our code? Are we giving our users value, transparency, and control? Open up as much as you can as often as you can, and if anyone questions whether this is a good approach, explain to them why it’s not just a good approach, but the best approach. It is an approach that will transform business and commerce in this still young century, and when we are successful we will effectively re-write the MBA curriculum for the next several decades!

Make no mistake, I am on board with the stated goals. However, in order to rewrite that MBA curriculum, I need to see Google be just as open at their moneymaking core stack layer (search and ads) as they want everyone else to be with operating systems, networks, and intellectual property (books, music, news articles, etc.) It is just a little too convenient to have an excuse about why one’s own layer cannot be opened up (spammers), but that everyone else’s layers can, hackers and bootleggers be damned.

I feel very strongly about all this, but that does not mean that I am correct. Rather, I am taking Google seriously at its word: “I encourage you to carefully read, review, and debate [these principles]. Then own them and try to incorporate them into your work. This is a complex subject and if there is debate (and I’m sure there will be) it should be in the open! Please feel free to comment.” Consider this blog post my first of many comments. Not for the purpose of tearing down, or argument for argument’s sake. But for the purpose of getting these challenges and questions and comments out in the open, exactly as requested, in order to further the same end goal. It would be fantastic if Google succeeded, and I agree with them: It’s not just a good approach, but the best approach. But in order to start transforming business and commerce, Google needs to set the example by being open in its core area (search and ads) and trusting that the chaos of openness will defeat spammers in addition to hackers and bootleggers. Right now, for all the openness in non-core business areas, and for all the talk, Google is unwilling to be open where it really matters.

Update: TechCrunch makes almost exactly the same points, without going into as much detail as I have above. Still, I think the details that I mention add an important layer to the overall discussion. First, I have pointed out how Google can be open about its end search results/data without having to open its algorithms (via allowing metasearch and via port-to-Bing personalized search). The problem is that doing so would disintermediate Google, and push them down from the top of the stack. Why? Because now users could build all sorts of applications on top of Google search results, instead of going through the ad-filled Google interface. So openness is a problem when it comes to making money. I also think there is value in pointing out that, were the algorithms themselves to be opened up, it’s not like search is the only industry that has to contend with miscreants. OS developers have to deal with hackers, content creators (musicians, authors) have to deal with bootleggers. So if you ask the OS to go open-source, and the musician to go DRM-free, what’s so different about asking the search engine to go open-algorithm?

Update 2: Also check out Chris Dixon’s post (http://cdixon.org/2009/12/22/google-should-open-source-what-actually-matters-their-search-ranking-algorithm/) and Danny Sullivan’s comments (http://cdixon.org/2009/12/22/google-should-open-source-what-actually-matters-their-search-ranking-algorithm/#comment-27024421), both of which are quite similar in spirit to where I am coming from on this matter. Also interesting is Harvard Business School Prof. Tom Eisenmann’s take (http://platformsandnetworks.blogspot.com/2009/12/googles-svp-product-management-jonathan.html). If discussion is what Google wants, discussion is what Google gets 😉

5 Responses to Google and the Meaning of Open

Raza says:

December 22, 2009 at 3:18 pm

Very informative and interesting post Jeremy. I actually read Techcrunch’s post on this issue before I read yours. But your post is an in-depth analysis of what Erick was saying. In particular, comparing Google’s reluctance to open source its search and advertising algorithms to Microsoft’s strategy was spot-on. I personally believe that Jonathan Rosenberg’s blog post was perfectly crafted so that it highlights how open Google is and at the same time debunk Apple.
jeremy says:

December 22, 2009 at 4:31 pm

Thanks, Raza. While I feel strongly about the things I do, I hope that the spirit of this post has come across as: “Ok, here are a few points of contention. Let’s discuss them and figure out what they mean within the context of the overall larger goal, openness.”
jeremy says:

December 22, 2009 at 4:39 pm

..and if you haven’t, I would also strongly encourage you to read those two Phil Windley and one Jon Udell posts, above. Lots of interesting discussion about what it really means to give the user control:

– It’s My Browser and I’ll Auto-Click If I Want To
– Claiming My Right to a Purpose-Centric Web
– Magic Glasses and Magic Projectors: Private Versus Public Augmentation of Experience

It’s a good exercise to think about how those ideas apply to search and ads.
Pingback: Information Retrieval Gupf » A Fragile Local Maximum for the Web
Pingback: Information Retrieval Gupf » What You Can Find Out

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

5 Responses to Google and the Meaning of Open

Leave a Reply

Recent Posts

Recent Comments

Archives