Social Implications – Information Retrieval Gupf

A Button Without The Treat

jeremy — Mon, 13 Jun 2011 12:54:01 +0000

A few months ago I wrote a post entitled +1 is Explicit, but is not Relevance Feedback. I am often personally concerned that, with many of the posts I write, I am being pedantic. However, last week TechCrunch came to the same conclusion: +1 Is Like A Button You Push For A Treat — Without The Treat. Some highlights:

I understand the concept behind the +1 Button — it’s a smart one. You get people to click it and it improves the page’s search ranking for logged-in Google users with social connections (and eventually maybe all results). At least I think that’s how it works. But I have a hard time believing that all of you actually clicking on the button really get why you’re doing it. Don’t get me wrong, it’s great that you’re clicking on it! I am too on some of our stories. But I can’t help but get the feeling that it’s a bit like a cruel experiment we’re running. We put up a button, you click on it because it’s there, expecting you’ll get a treat. But there is no treat.

As I was saying a few months ago, +1 allows for explicit signaling. But that signaling just isn’t a relevance feedback-type of signaling. The person doing the clicking doesn’t actually get anything “fed back” from that action to their ongoing information seeking task. TechCrunch continues:

If the +1 Button is serving me up better results, I’m just not seeing it. And yes, I know the button push also populates your Google profile with a feed of our shared stories. But let’s be honest, no one is looking at those. We’re definitely not seeing any noticeable bump in pageviews coming from Google as a result of the button. Maybe that will slowly change over time, but I’m not convinced. The rate at which people are clicking on the button appears to be dropping each day. And soon it may be just like the *gulp* Buzz button.

This echoes what I said in my previous post:

In traditional feedback, an individual user marks a subset of documents as relevant and non-relevant, and then the system updates his or her ranked list results, immediately, so as to increase the recall (and sometimes also the precision) of documents not yet seen. There is a reason it is called feedback: the loop is closed. Just like when you hold a microphone too close to a speaker and start to get audio feedback. That’s only possible because the output of one input gets fed immediately back into that same input. Not into someone else’s input.

TechCrunch concludes:

Google needs to figure this out quickly. When you push a button, you need to get a treat. People will click for a while out of pure novelty and curiousness. But that only lasts so long. Without anything noticeable happening (like a share on Twitter, or a comment on Facebook), people will just ignore the button altogether. All over the web.

What has me scratching my head is why so many web search engines — and this +1 is just one example of the larger, industry-wise attitude — are so opposed to explicit relevance feedback. Yeah, I know the story: Altavista or Lycos tried some version of an explicit relevance feedback +1 button for a few months back in 1996 and it was found to not work well, because users were unwilling or too lazy to put any effort into using the tool. Well, with +1 and +1-type functionality, we’ve seen that users are indeed willing to put the effort into using the tool — at least until they find out that the tool isn’t really doing anything for them. So why not close the loop now — quickly! — before users build a strong association in their mind that +1-type buttons do nothing for the user, especially in the moment. An association that takes another 15 years to correct. Give the users a treat when they press the button. How? Close the loop of relevance feedback. This is an opportunity, not a criticism.

Miffed and Confused

jeremy — Wed, 15 Dec 2010 14:16:52 +0000

Have been on a six month blogging hiatus, and wouldn’t you know it.. it took another fun Google article to pull me back. It is a recent FastCompany piece, entitled Google to Zuckerberg, Bing: We Still Innovate. The premise of the article is that Facebook has recently partnered with Bing to deliver social search and cites Google’s slowed rate of innovation as one of the primary motivators for this move. This has left Google, one source says, “miffed and confused as to how Zuckerberg figured they weren’t innovating”.

Perhaps I could be of assistance.

The article cites a number of reasons why Google is miffed and confused about Facebook’s stance: “The company has more people working on search than ever before”. It has a “list of 100 [search] projects”. And “last year, the team launched about 550 changes to its search engine, and in September, unveiled Instant, one of the largest overhauls to its engine ever.” The article continues:

Every year, he says, Google runs thousands of experiments. These experiments include just about everything you could imagine: changing the color of a link or button; improving Arabic semantics; building real-time search; or creating Google Instant, the results-as-you-type feature. As Singhal says, it’s simply a function of having more resources: His team is able to test hypotheses faster. “We couldn’t do these things five years back even if we wanted to,” he explains. “We didn’t have enough engineers. But by having a bigger team today, we have new ideas, new people, and the capacity to execute on those ideas.”

So why would Facebook say that Google isn’t innovating, especially when 550 changes have been made? I think the next quote, by Google Fellow Amit Singhal, illustrates the problem perfectly:

Fast Company broached the subject recently with Google’s Amit Singhal, who oversees Google’s ranking and algorithm team. “The main reason why Google is where it is today is that we have been able to make huge changes to our search–and are able to do it while running this big search engine,” he says “We used to compare that to changing an engine on a jet while it’s flying. Over the years, we’ve not only mastered changing the engine while flying, but have been able to change the seats without the users noticing. That’s the beauty of how we innovate. You’ve suddenly given everyone first class seats, and they didn’t even wake up.”

The first problem is that because (most of) those 550 changes happen while the users are still “asleep”, users don’t actually notice them. Google doesn’t exactly go out of its way to make many of its search improvements visible to the user, and so it’s often difficult to tell whether or not something has happened. As a user, I personally don’t like that approach, because a change that is invisible or purposely hidden is a change that I as a user have no control over, and am not able to change back or alter further. And as I argued in an earlier post, the way to creating passionate search users is not to give them luxury seats without waking them up. Instead, the way to create passionate search users is to give them search tools that give users a path in which they can grow, improve, and get better at searching. Do users get better at flying, or at seeing and comprehending an information landscape from 30,000 feet, if they’ve got luxury chairs? Arguably not. If anything, the luxury chairs make it harder for users to sit upright, to have a “leaning forward”, engaged experience. Users are less inclined, pun intended, to be active participants in the experience. All the decision are being made for them.

But let’s set the user perception issue aside for a moment. Even if the user doesn’t notice those 550 improvements at a conscious level, that doesn’t change the fact that Google has innovated 550 times over the past year, does it? Of course not. The innovations have still happened, they exist. But what innovations are they? Well, as Singhal’s airplane analogy suggests, they are improvements that make the existing experience faster and a little more comfortable. Cushier seats. A better shade of link blue. More legroom. 5 pixel margins rather than 2 pixel margins. A faster plane with a more powerful engine. Google Instant.

But at the end of the day, it’s still a plane. And the view of the information landscape is still from 30,000 feet, even if that view is an Instant view. What if instead of getting the high level overview of relevant information, the user wants to dig down into a narrow, deep, richer vein? What if the user wants to mine for precious information ores, rather than fly over the mountain five miles overhead? What if the user wants the information retrieval engine to act as a excavator, deep earth drill, or other such heavy mining tool? Do any of those 550 changes help make the airplane more like an underground drill? Or do all 550 changes simply make a sleeker, faster airplane?

I’ve talked about this issue of evolutionary vs. long term thinking (improving the airplane, versus changing it into a deep earth drill) in the past. I’ve also asked for search to change radically, to help me in much harder information seeking tasks such as finding hidden cafes in Prague. But I think this question of innovation is illustrated perfectly in the following bit from the article:

But what about social search? Facebook teamed with Microsoft, not Google. Does Google have any partnerships planned for social search? “I’m glad you asked, because we launched social search about two years or so back,” says Singhal.

When Google launched social search 13.5 months ago (October 26, 2009 to December 9, 2010 is not two years), what they launched was this:

A lot of people write about New York, so if I do a search for [new york] on Google, my best friend’s New York blog probably isn’t going to show up on the first page of my results. Probably what I’ll find are some well-known and official sites. We’ve taken steps to improve the relevance of our search results with personalization, but today’s launch takes that one step further. With Social Search, Google finds relevant public content from your friends and contacts and highlights it for you at the bottom of your search results. When I do a simple query for [new york], Google Social Search includes my friend’s blog on the results page under the heading “Results from people in your social circle for New York.” I can also filter my results to see only content from my social circle by clicking “Show options” on the results page and clicking “Social.”

Having worked in the area of Collaborative Search (see also this) for the past four years, an area that is not unrelated to Social Search, I have long learned to make the following distinction: There is a difference between process and data. Data-based social search is the idea of having content generated by your social circle show up in your results, e.g. your friend’s NY blog. Process-based social search is the idea of using your friends’ patterns of information seeking behavior to influence the ranking of content from outside of your social circle.

Another way of expressing this distinction is “search of social data” versus “social search of data”.

Showing your friend’s blog when you search for New York is search of social data. It’s interesting, but I wouldn’t necessarily characterize it as an innovative, game changing leap. Social search of data, on the other hand, is a much more radical approach, and much more of a leap. It affects how one finds every piece of information on the entire web, not just your friends’ blogs. I started seeing the concept of social search of data appear 4 to 5 years ago, with the work of Barry Smyth, and Microsoft publicly started publishing work in this area around three years ago. So it does make sense to me that Facebook would partner with a company that has more of a track record in social search of data, rather than search of social data.

Don’t get me wrong; I am not saying Google doesn’t innovate. It does. I am simply trying to explain to those who were miffed and confused why someone would say that. A better link shade of blue makes the search process more comfortable; it plushes up the airplane seats. And entire engine rebuild so as to allow instant results makes things faster. But it doesn’t fundamentally alter the manner in which information is found. It doesn’t utilize social behaviors to rerank the entire web. It doesn’t let me dig deeper, and find hidden cafes in Prague. It just lets me not find those same hidden Prague cafes…faster and more comfortably.

If the engine rebuild around Google Instant was one of the “largest overhauls to its engine ever” as the article says above, and it only quantitatively changes the speed at which results come back rather than qualitatively changing the manner in which information seeking happens, then it is not unreasonable to seek a different type of innovation. At some point search has to become more than precision@3, more than a fast, comfortable ride. At some point search has to become a real tool for exploration and growth, for comparison and learning. At some point, the definition of innovation has to move from step to leap. Google has the engineering chops to make this happen. But do they have the culture?

Simplicity: Sparsity or Storytelling?

jeremy — Thu, 10 Jun 2010 17:39:00 +0000

A tweet by @akumar prompted me to punch up this quick blogpost:

as with all controversial issues, there’s a positive in google trying bing/image – that they’re not afraid to learn from competition

What Amit is referring to is the recent addition of gorgeous photographic images as search page background. See for example this writeup: http://blogs.abcnews.com/theworldnewser/2010/06/google-vs-bing-copycat-picture-on-prominent-page.html

He is of course correct; Google is learning from the competition. But there is another issue at play here, one that I don’t want to overlook because I feel it is very important. It is the issue of simplicity. What is simplicity? How is it defined? How is it measured? Conversely, what is complexity? What is clutter?

For over a decade now, Google has essentially defined simplicity as sparsity. Sparse backgrounds, lots of negative space, sparse color schemes, sparse auxiliary information (e.g. query term suggestions on the SERP page have only started appearing in the last year or two, despite the fact that such features existed 15 years ago in search engines of old such as Infoseek and Altavista). The reason given was that people didn’t like clutter, that people like simplicity. And in Google’s definition, simplicity equals sparsity.

I agree. People do like simplicity. I don’t question the veracity of that general sentiment. What has always bothered me, though, is the equivocation of simplicity with sparsity. I think a much better definition of simplicity is not the amount of information or colors or negative space on a page, but the story that a design, interface, interaction, or algorithm tells. Something with a lot of colors and links and words can still be simple…if it tells a clear story! Conversely, something with fewer colors and links (sparser) can be more complex, if the story that it communicates is muddy and not as purposely focused.

This brings us to the Bing background image. In my opinion, the even though the inclusion of a background image is less sparse and more “cluttered” (more colors, more shapes, more textures), it actually assists in the telling of a clearer story. Why? Because it more cleanly separates foreground and background, subject and frame. It provides compositional balance to the page. The white query input box on white background (10+ years of Google design) is sparser, but the story that it tells is less clear because foreground and background are not as cleanly separated. A white query input box on a richly colored and textured background tells a clearer, simpler story because the background image frames and separates the foreground query input box. Furthermore, because you can now distinguish background and foreground, you can more clearly see that the query input box lies near the pleasing “rule of thirds” line, which aids further in the overall storytelling.

In short, I applaud this move by Google, just as I applaud it from Bing. I never liked the white-on-white, because sparsity is not the same thing as simplicity. Simplicity arises through good storytelling, not through minimalism. No A/B testing will tell you this, though. It’s a definitional issue that must be defined before you start your A/B tests. Google has learned from the competition, as @akumar says. But I hope that the lesson Google has learned is not just that users like pretty pictures. I hope the lesson is that, when it comes to simplicity, there is a difference between sparsity and storytelling.

See also my posts: The Tyranny of Simplicity, The Tyranny of Simplicity, Redux, and The Craft of Storytelling. I also found this older discussion on Google’s Lively to be a fascinating read. In my understanding, the issue of “necessary complexity” that the author of that post hammers home about is related to the issue of storytelling. Too much sparsity (of interaction in Lively’s case) leads to an inability to tell a clear story. Simplicity is storytelling, not sparsity.

Embark Together

jeremy — Mon, 15 Mar 2010 20:52:12 +0000

I would like to quickly follow up on my previous post on explicitly collaborative information seeking. My claim in that post was that, despite the shared terminology, a service like Aardvark (or Twitter) is not truly collaborative.

Let me be clear about Aardvark: What that service does is help you comb through a network of people to find those individuals who have the highest likelihood of holding the answer to your information need. Somebody has the answer; you just don’t know who it is. So Aardvark helps you find that somebody. The reason this is different from what I am talking about with explicit collaboration is that in this latter case, you already know who it is that you want to work with on resolving a shared information need. You want to work with a relationship partner on finding an apartment. You want to work with a business colleague on finding potential markets for a new product. You want to work with some buddies on planning a road trip. In all of these situations, your partner, your colleague, and your buddies don’t already have the answers that you seek. But you do know that you want to work with them to find those answers because they have the same need that you do. Your partner wants to live with you, your business colleague wants to work with you, and your buddies want to travel with you. This is what explicitly collaborative information seeking is about, and it’s not the same thing as the “collaborative” category discussed in the panel.

Case in point: Take a look at the panel’s slides: http://www.slideshare.net/bmevans/introductory-slides. Slide 9 outlines the two main social strategies: (1) Ask the network, and (2) embark alone. This misses a third major, but as yet untapped, strategy: (3) embark together.

A good way to think about this is in terms of information seeking. In both the (1) ask the network and (2) embark alone strategies, there is only a single user with an actual information need, a single person who is actively seeking information. Using Aardvark, he or she is asking other people in the network if they are able to give an answer to satisfy that need. But those other individuals do not actively share your information need. They already either (1) have the information that you seek, and thus already have a satisfied information need, or (2) do not have the information you seek, but do not care, i.e. they do not share your information need (they aren’t going to move in with you, or go on that road trip with you). When you ask the network, you are not actually involved in collaborative information seeking. There is only a single seeker: You. You are simply tapping into the network to find those people who already have the information you need. It is still the single individual, not the network, that has the information need and that is actively engaged in the seeking process.

But embarking together with one or two other individuals who also lack information, i.e. engaging in explicitly collaborative information seeking, is a entirely different process. In this case, there are at least two information seekers, two people who have a shared, as-yet-unsatisfied, information need. Now, there are a number of different ways you can build systems and design interfaces to support these multiple seekers in their task. I’ve written a lot about such systems on this blog and on the FXPAL blog, and will not go into it in further detail right now. The point is simply that embarking together is an information seeking strategy that was not covered by any of the existing methods. It is not the same as asking the network. It is not the same as embarking alone. It is a third process, a third strategy, and one that remains quite untapped in today’s marketplace.

Update: I have a final quick example. On his live blog, Danny Sullivan paraphrases Max from Aardvark: “We want to do that across communication channels, so you can find partners to go bike with”. That’s Aardvarkian social search: You want to find the people to go biking with. Collaborative search is the next phase. Once you’ve found the people that you explicitly know that you want to go biking with, how do you find out where you want to go? You know about all the bike trails around your house. Your new biking partner knows about all the trails near her house. But neither of you know about the trails that exist halfway between both of your houses. Ideally, you’d like to find one of those trails that is good for both of you, because neither of you is aware of them. (And why should you have been? Before meeting your partner, you had no reason to venture away from your favorite nearby trails.) THAT is explicitly collaborative information seeking. When both of you actively look for new bike trails, that is embarking together.

A Fragile Local Maximum for the Web

jeremy — Thu, 24 Dec 2009 00:23:10 +0000

On Twitter today, Josh Young made an interesting observation to which I would like to call attention:

Ya, @jerepick, with “fauxpen” attached, google’s “nav. search as the top of the stack” is a fragile local maximum for the web.

This observation is a followup to the web-wide discussion that Google kicked off about the meaning of open. Essentially, Rosenberg says that all of Google’s products at that are not at search layers of the stack should work toward being open, but that the search layer itself should be closed. To protect it from spammers, you understand {cough}.

Earlier in the same post Rosenberg makes a distinction between open source and open data, calling for increased openness in both. However, when it comes to defending closed-search, this distinction gets lost. But this distinction between open source vs. open data is important. Here is how it translates to the search domain:

Open Source = Open search algorithm is about letting the world know what features are used to rank pages and how those features interrelate (are weighted)
Open Data = Open search results is about letting users refactor, remix, reuse, mashup, store and re-search locally any and all query results that the user issues. And about letting the user use any software that they want to accomplish this — not just Google software

The excuse given about why Google cannot open up is that of spammers would be able to game the engine. But if we look closely, we’ll see that it is an excuse that is primarily, if not exclusively, related to the “open source” aspect of openness. Black hat SEO algorithmic gaming is not an issue when it comes to user results re-use and remixing.

And so the point (I think) Josh is making is that by closing not only the algorithm, but also the results of that algorithm, Google has effectively declared a moratorium on Internet application stack progress along that vertical. Google is essentially saying to the Internet: “You shall not pass. If anyone wants to develop a application that makes use of search results as a “lower” stack layer, that person will have to write an entire search engine, themself. We are in favor of any layer underneath us — or parallel to us such as gmail — growing the internet pie, but we will not directly participate in growing the pie ourselves by opening up our results so as to allow search itself to become a middling layer in someone else’s stack.”

This does not sit well with me. A search engine by nature is built on the stack layer of web page content, which is built on the stack layer of internet and transmission control protocols, and so on. To say that users have no right to use whatever software they choose to build further on this search engine layer denies users the same basic open rights that people like Lawrence Lessig so passionately fight for. In fact, some have argued that these rights are not even Google’s to grant, that fair use lets us re-use the search results that we, by our querying effort, had a hand in creating. So in the spirit of openness Google should fully open up its results to programmatic, API access. And allow users to remix and reuse, to metasearch and to share (e.g. social search and collaborative search). That grows the pie, does it not?

I do understand why Google will not do this. It’s because by so doing, Google would effectively allow itself to be disintermediated, the same way they are currently disintermediating the newspaper industry. By decoupling results from ads (where money is made), it makes it much more difficult for Google to monetize its traffic — a problem that all disintermediated layers of the stack face. Naturally Google doesn’t want to put itself in this position. But that is what makes their current stance on “open” all the more perplexing — they expect others (e.g. newspapers) to open up their revenue-stream stack layers, but refuse to do so themselves. Why take such a strong position on openness and then give an unrelated (spammer) excuse about why you cannot be?

So why am I writing about this? Again, let’s go back to something Josh just retweeted:

RT @jonathanglick The Open debate matters because, right now, for the first time in a decade, the forces of Closed are on the march.

This, combined with Google’s open call for earnest web-wide discussion and debate, has increased my desire to add to the conversation. And my point here is that spammers are not the issue when it come to making Google search “open”. You can open the data without opening the algorithm. If anyone has pointers to refutations to this line of reasoning, I welcome them.

Google and the Meaning of Open

jeremy — Tue, 22 Dec 2009 11:44:43 +0000

There is a fantastic Google blog post today by Jonathan Rosenberg on the meaning (and value) of openness. Whooo-boy.. where do we start with this can of worms? Guess I’ll jump right in. Warning: This is probably the longest post I’ve written, so if you are easily bored, understand that this is not required reading. It will not be on the test.

Here we go:

At Google we believe that open systems win. They lead to more innovation, value, and freedom of choice for consumers, and a vibrant, profitable, and competitive ecosystem for businesses.

Agreed! I’m fully on board the spirit of this opening statement!

Many companies will claim roughly the same thing since they know that declaring themselves to be open is both good for their brand and completely without risk.

True. So the question arises: What happens when being open carries with it an amount of risk? Do you open up those areas of your business as well? Or do you forever keep your most valuable layer of the stack closed and proprietary, both in terms of closed source as well as not-fully-open information?

We run the company and make our product decisions based on these principles, so I encourage you to carefully read, review, and debate them. Then own them and try to incorporate them into your work. This is a complex subject and if there is debate (and I’m sure there will be) it should be in the open! Please feel free to comment.

I like the spirit of this discussion so far. I earnestly believe that Google is debating these things internally. But I also take them at their word that they would like this debate to be in the open. Consider this blog post part of my ongoing comment, and ongoing engagement in what I consider to be an extremely important area: The organization and dissemination of information.

There are two components to our definition of open: open technology and open information. Open technology includes open source, meaning we release and actively support code that helps grow the Internet, and open standards, meaning we adhere to accepted standards and, if none exist, work to create standards that improve the entire Internet (and not just benefit Google). Open information means that when we have information about users we use it to provide something that is valuable to them, we are transparent about what information we have about them, and we give them ultimate control over their information.

Ok, first question: Why does open information only include the information that you collect about users? Why does it not include information about what you are trying to do for users, and how, and why? Why does it not include what metrics you are optimizing, so that the user can understand what the search ranking functions are trying to do for him or her. For example, is Google open enough to enable me to know how much diversity is intentionally being injected into my search results? Will they allow me to know to what extent my particular query being optimized for precision or for recall? Will they allow me to know what factors went into the decision to rank page X higher than page Y, and can I change those factors, so that I am able to instruct the search engine to give me different sets of results for the same exact query term, so as to maximize my value for my own understanding of my own particular information need? Never mind whether or not most users would even want to do something like this. Most probably do not, but many (more than 5%, I have little doubt) do. Is Google open (transparent) enough, in principle, to ever allow a user to make use of the service in this manner? Or is this something that will forever be hidden from the user’s view?

More importantly: Why is the information shown to the user not symmetric? Google stores (and uses) more about the user than the user is allowed to store (and use) about Google. Let me give a concrete example: When I run a query, Google knows a number of piece of information that arises out of my information-generating actions. It knows:

The text of the query itself, along with the timestamp
All the results that are mutually generated from intersection of the query (my intellectual product) and the algorithm (Google’s intellectual product)
The link(s) that I clicked as a result of that query
The link(s) that I didn’t click as a result of that query

Last I checked, Google allows me to access and export (1) and (3). But they do not allow me to access and export (2) and (4). It’s not that I didn’t interact with those results, or have a shared hand in their creation. I created the query in the first place. And then for a number of results, I viewed them and explicitly made a choice (judgment) not to visit certain pages. Just because there was no click does not mean that there isn’t any information. This idea is counterintuitive at first, but it’s true. There was an interface, an option about whether or not to click, and a decision not to click. I created that information during my search session. So why can I not export that information? Why can I not get a list of all the results that I viewed, that I decided not to click?

If this information were open, especially through an API, I could start to do all sorts of interesting things with it. I could keep track of whether or not there were certain pages that kept coming up in the results, over and over. And maybe that would allow me to reevaluate my initial non-relevance decision and look deeper into a piece of information that had initially not appeared fruitful (not had much of an information scent associated with it). Being able to keep API, algorithmic track of all the pages I didn’t click would also allow me to compare and contrast relevance and non-relevance information on Google with other search engines such as Yahoo! and Bing. It would allow an ecosystem of services providers to grow up around Google and provide me with software solutions that allowed me to keep tabs on all the search engines simultaneously and understand the relative differences between and relative merits of each. In other words, being able to API-download both my clicks and my non-clicks would allow me to metasearch Google! (And there is a long history of academic literature on the value of metasearch; I don’t need to go into it here.)

I am not talking about some third party company scraping Google’s results and displaying them on their own website for their own purposes. I’m talking about being able to use software (that I’ve licensed from a third party) to do it myself, for myself. For more on this, see Phil Windley’s posts, It’s My Browser and I’ll Auto-Click If I Want To and Claiming My Right to a Purpose-Centric Web and Jon Udell’s post Magic Glasses and Magic Projectors: Private Versus Public Augmentation of Experience (see also the comments/discussion in this latter post). As Windley writes, the issue is this: Do people have the right to control how Web content is displayed in their browser? Openness of search data and information, including all the data associated with that mutually-interactive process (both clicked and non-clicked results) is the cornerstone of transparency and end-user value. In order to claim openness, and to establish end-user value, I feel that there has to be a completely symmetry between everything that Google stores (and makes use of, internally) about the user, and everything that the user is allowed store (and make use of, internally) about Google.

So is there a symmetry? No. Read the Terms of Service:

The Google Services are made available for your personal, non-commercial use only…{snip}…You may not take the results from a Google search and reformat and display them, or mirror the Google home page or results pages on your Web site. You may not “meta-search” Google…{snip}…You may not send automated queries of any sort to Google’s system without express permission in advance from Google. Note that “sending automated queries” includes, among other things: using any software which sends queries to Google to determine how a website or webpage “ranks” on Google for various queries; “meta-searching” Google; and performing “offline” searches on Google. Please do not write to Google to request permission to “meta-search” Google for a research project, as such requests will not be granted.

Compare that to what was being said above:

Open information means that when we have information about users we use it to provide something that is valuable to them, we are transparent about what information we have about them, and we give them ultimate control over their information.

To me, that does not sound like I have ultimate control over my information…my queries, my clicks, and my non-clicks. If I had ultimate control over those, then I could take those clicked and non-clicked links and use them in any way that I wanted. I could re-search those links, offline, without having to return to Google (value to the user: this response time would be much quicker, and internet connectivity would not be required — privacy could also be enhanced!) I could mash those links up with other search results, from both Google and non-Google sources (metasearch; value to the user: better results, more diverse results, more complete results). As Google says:

Another way to look at the difference between open and closed systems is that open systems allow innovation at all levels — from the operating system to the application layer — not just at the top. This means that one company doesn’t have to depend on another’s benevolence to ship a product.

That is, if I want to install software that lets me metasearch Google, the creator of that software should not have to depend on Google’s benevolence in order to be able to create their product. Openness at the system level is what allows a third-party company develop the software, and Google-to-user openness on the data level is what lets me, the user, put my data into that software to mash up all of my queries, clicks, and non-clicks, essentially metasearching Google. This is exactly the sort of scenario that openness is meant to address.

Continuing on:

If we can embody a consistent commitment to open — which I believe we can — then we have a big opportunity to lead by example and encourage other companies and industries to adopt the same commitment. If they do, the world will be a better place.

And:

If they use our products and store content with us, it’s their content, not ours. They should be able to export it or delete it at any time, at no cost, and as easily as possible.

Again, I agree. When all search engines, Google included, allow me to store (and reuse!) my search interaction data (all queries, clicks, and non-clicks) then this data truly becomes valuable to me, the user. Yahoo! has taken much more of a lead in this area with SearchMonkey. Not only do I not see the same openness from Google, I see Terms of Service that actively discriminate against these sorts of applications and usage of one’s own data.

Open systems are just the opposite. They are competitive and far more dynamic. In an open system, a competitive advantage doesn’t derive from locking in customers, but rather from understanding the fast-moving system better than anyone else and using that knowledge to generate better, more innovative products. The successful company in an open system is both a fast innovator and a thought leader; the brand value of thought leadership attracts customers and then fast innovation keeps them. This isn’t easy — far from it — but fast companies have nothing to fear, and when they are successful they can generate great shareholder value. Open systems have the potential to spawn industries. They harness the intellect of the general population and spur businesses to compete, innovate, and win based on the merits of their products and not just the brilliance of their business tactics.

Make no mistake: I greatly admire the ideals being expressed here. I cannot stop agreeing. I just have a hard time reconciling this with what is written later on in the same post:

While we are committed to opening the code for our developer tools, not all Google products are open source. Our goal is to keep the Internet open, which promotes choice and competition and keeps users and developers from getting locked in. In many cases, most notably our search and ads products, opening up the code would not contribute to these goals and would actually hurt users. The search and advertising markets are already highly competitive with very low switching costs, so users and advertisers already have plenty of choice and are not locked in.

By only making part of one’s information available (queries and clicks) and not the other part (seen but non-clicked results that were created by the user+Google collaborative query+algorithm search session, as well as unseen and non-clicked results from that same session), it makes it much more difficult for a user to migrate to another service. For example, what if Bing were to start offering full-time, default personalization the same way Google now does? Google undoubtedly trains one’s personalized algorithm using all of one’s own information: queries, clicks and non-clicks. Abandoned searches (and knowledge of exactly which results were not clicked) are just as much an important part of the overall algorithmic mixture, are they not? They convey important information. So suppose Bing now wanted to allow you, the user, to jump headfirst into personalized Bing results. The best way to do that would be to upload your Google search data to Bing, so that Bing could start personalizing based on this same years-long history. Google does not allow this. That information is not available for you to use, in any third-party application…whether metasearch or Bing-provided personalized search. Saying that a user is only one click away from another search engine masks the full truth. If the first search engine has learned something about a user and can therefore provide different, personalized results for that user based on months or years of history, then that search engine is stickier than others. In order to be truly open, a user has to be able to “export” his or her profile, both the clicks and the non-clicks, and take that information to another search engine, or else there will be lock-in.

The Google post continues:

Not to mention the fact that opening up these systems would allow people to “game” our algorithms to manipulate search and ads quality rankings, reducing our quality for everyone.

Ok, here is where I have to take another strong, contrary stance. Google just got finished saying the following, earlier in the post:

Open systems are just the opposite. They are competitive and far more dynamic. In an open system, a competitive advantage doesn’t derive from locking in customers, but rather from understanding the fast-moving system better than anyone else and using that knowledge to generate better, more innovative products.

So let me get this straight: Google is pro-openness in all the areas where they don’t make any money, where openness doesn’t actually affect the bottom line. But when it comes to the moneymaker, that has to be closed and proprietary to protect it from spammers? But I thought that open systems are just the opposite. Doesn’t competition and dynamicism create an environment in which everyone can work on the problem of fighting spam, thereby coming up with a better solution and a quicker solution than a closed environment can produce? Isn’t one of the maxims of openness the idea that the smartness of the competitive environment as a whole can far outproduce any one company? Isn’t that the whole idea behind the Netflix prize (for example) that as everyone shares with each other their solutions, everyone gets better than any one team could have done on their own?

Opening up Google’s algorithms might allow people to “game” them for a short time. But in the long run, the benefits that come from openness would outpace the spammers’ ability to game. Right? Isn’t that the technological optimism that this blog post is expressing?

Our skills and our culture give us the opportunity and responsibility to prevent this from happening. We believe in the power of technology to deliver information. We believe in the power of information to do good. We believe that open is the only way for this to have the broadest impact for the most people. We are technology optimists who trust that the chaos of open benefits everyone. We will fight to promote it every chance we get. Open will win. It will win on the Internet and will then cascade across many walks of life: The future of government is transparency. The future of commerce is information symmetry. The future of culture is freedom. The future of science and medicine is collaboration. The future of entertainment is participation. Each of these futures depends on an open Internet.

Open will win. Open will beat the spammers, will it not? Hasn’t Google just expressed confidence that it will? Simultaneously, open will better serve the users. And open will allow third-parties to create exploratory- and recall-oriented and social and collaborative search systems that make use of Google algorithms and indices as just another layer in the overall information organization and dissemination stack. This grows the overall pie. Remember:

Another way to look at the difference between open and closed systems is that open systems allow innovation at all levels — from the operating system to the application layer — not just at the top. This means that one company doesn’t have to depend on another’s benevolence to ship a product.

Search is another layer in that stack, one that should be just as open as the other layers, and one that need not have fear from gaming, because openness will overcome. That’s the call-to-arms that I see expressed in this blog post. And it gets me excited. At the same time, I see an elephant in the room. And that is Google’s unwillingness to be open in the one (and mainly only) area where money is made. And what really bothers me about that is that it seems no different from any other technology company: All technology companies wants openness in those layers of the stack where they don’t make money, and closedness where they do. Google does not seem any different.

Even Google’s excuse for not being open is very similar to others, like Microsoft’s. Microsoft, which makes the bulk of its money on Windows, says that Windows can’t be open-sourced because hackers will see all the bugs in the code and be able to more easily exploit the OS and write viruses. Companies like Google say “nonsense”, and point to open source OSes like Linux as an example of how openness can breed hardened code that is less hackable. And artists and record labels and newspapers also worry about bootleggers and content-copiers depriving them of their stack-layer income, despite absolutist assurances from Google that increased traffic guarantees monetization. So why does Google think that if we open-sourced search algorithms, the community would not be able to “harden” those algorithms against spammers, and simultaneously guarantee Google’s income? There may be a little chaos at first, but the chaos of open benefits everyone.

Toward the end, Google wraps up:

All of this is useless, however, if we fail when it comes to being open. So we need to constantly push ourselves. Are we contributing to open standards that better the industry? What’s stopping us from open sourcing our code? Are we giving our users value, transparency, and control? Open up as much as you can as often as you can, and if anyone questions whether this is a good approach, explain to them why it’s not just a good approach, but the best approach. It is an approach that will transform business and commerce in this still young century, and when we are successful we will effectively re-write the MBA curriculum for the next several decades!

Make no mistake, I am on board with the stated goals. However, in order to rewrite that MBA curriculum, I need to see Google be just as open at their moneymaking core stack layer (search and ads) as they want everyone else to be with operating systems, networks, and intellectual property (books, music, news articles, etc.) It is just a little too convenient to have an excuse about why one’s own layer cannot be opened up (spammers), but that everyone else’s layers can, hackers and bootleggers be damned.

I feel very strongly about all this, but that does not mean that I am correct. Rather, I am taking Google seriously at its word: “I encourage you to carefully read, review, and debate [these principles]. Then own them and try to incorporate them into your work. This is a complex subject and if there is debate (and I’m sure there will be) it should be in the open! Please feel free to comment.” Consider this blog post my first of many comments. Not for the purpose of tearing down, or argument for argument’s sake. But for the purpose of getting these challenges and questions and comments out in the open, exactly as requested, in order to further the same end goal. It would be fantastic if Google succeeded, and I agree with them: It’s not just a good approach, but the best approach. But in order to start transforming business and commerce, Google needs to set the example by being open in its core area (search and ads) and trusting that the chaos of openness will defeat spammers in addition to hackers and bootleggers. Right now, for all the openness in non-core business areas, and for all the talk, Google is unwilling to be open where it really matters.

Update: TechCrunch makes almost exactly the same points, without going into as much detail as I have above. Still, I think the details that I mention add an important layer to the overall discussion. First, I have pointed out how Google can be open about its end search results/data without having to open its algorithms (via allowing metasearch and via port-to-Bing personalized search). The problem is that doing so would disintermediate Google, and push them down from the top of the stack. Why? Because now users could build all sorts of applications on top of Google search results, instead of going through the ad-filled Google interface. So openness is a problem when it comes to making money. I also think there is value in pointing out that, were the algorithms themselves to be opened up, it’s not like search is the only industry that has to contend with miscreants. OS developers have to deal with hackers, content creators (musicians, authors) have to deal with bootleggers. So if you ask the OS to go open-source, and the musician to go DRM-free, what’s so different about asking the search engine to go open-algorithm?

Update 2: Also check out Chris Dixon’s post (http://cdixon.org/2009/12/22/google-should-open-source-what-actually-matters-their-search-ranking-algorithm/) and Danny Sullivan’s comments (http://cdixon.org/2009/12/22/google-should-open-source-what-actually-matters-their-search-ranking-algorithm/#comment-27024421), both of which are quite similar in spirit to where I am coming from on this matter. Also interesting is Harvard Business School Prof. Tom Eisenmann’s take (http://platformsandnetworks.blogspot.com/2009/12/googles-svp-product-management-jonathan.html). If discussion is what Google wants, discussion is what Google gets

Loss Leaders versus Exploratory Search

jeremy — Tue, 15 Dec 2009 19:35:38 +0000

Chris Dixon has a post yesterday about search and the social graph. An interesting read, but what struck me the most was a tangent about how current search engines make money:

Lost amid this discussion, however, is that the links people tend to share on social networks – news, blog posts, videos – are in categories Google barely makes money on. (The same point also seems lost on Rupert Murdoch and news organizations who accuse Google of profiting off their misery).

Searches related to news, blog posts, funny videos, etc. are mostly a loss leaders for Google. Google’s real business is selling ads for plane tickets, dvd players, and malpractice lawyers. (I realize this might be depressing to some internet idealists, but it’s a reality). Online advertising revenue is directly correlated with finding users who have purchasing intent. Google’s true primary competitive threats are product-related sites, especially Amazon. As it gets harder to find a washing machine on Google, people will skip search and go directly to Amazon and other product-related sites.

I’ll repeat the salient bit: “Google’s real business is selling ads for plane tickets, dvd players, and malpractice lawyers.” What struck me about this statement was not its veracity. What struck me was its relationship to exploratory search. It is when searching for a plane ticket, purchasing an expensive consumer good, or hiring a decent lawyer that my need for exploratory search is at its highest.

So my question is whether or not there is a tension here between getting the users off of the results page as quickly as possible — especially when the route off that page is typically via an advertisement on which the search engine makes money — versus enabling the user to remain on the results page in a process-oriented mode of sorting and filtering and playing around with the results in a myriad of different ways, so as to come up with a set of options that best satisfies the exploratory need.

Do these two goals conflict? Why or why not? It is an old question, but I am still searching for a satisfactory answer.

Update: Perhaps I should have been more clear as to what characterizes an exploratory search session. There are dozens of papers out there that tell the story much better than I can, so I will quote one of them. It’s by Michael Levi at the U.S. Bureau of Labor Statistics, published at the Information Seeking Support Systems (ISSS) workshop in June 2008. Title of the paper is “Musings on Information Seeking Support Systems”. (See http://ils.unc.edu/ISSS/ISSS_final_report.pdf) I quote:

Some characteristics of open-ended, discovery-oriented exploration emerge:

1) I may not know, at the beginning, whether a seemingly straightforward line of inquiry will expand beyond recognition. Sometimes it will, sometimes it won’t. A lot depends on my mood at any given moment.

2) I can’t predict when the exploration will end. It may be when I’m satisfied that I have learned enough (which also would vary from day to day and query to query.) It may be when I get tired or bored. It may be when I’ve run out of time. Or it may be when I get distracted by dinner or the allure of the swimming pool.

3) I can’t determine, objectively, whether the exploration has been a success. There is usually no “right answer” against which I can measure my progress.

4) My exploration is not a linear process. I could get interested in a tangent at any time from which I may not return. I am also likely to backtrack, possibly with some regularity, either because a tangent proved unfulfilling and I want to resume my original quest, or because I thought of a new question (or a new way of formulating a previous question) to direct at a resource I visited previously.

5) I am likely to want to combine, compare, or contrast information from multiple sources. One of those sources is my memory – which may or may not be reliable in any given circumstance.

Levi then makes a number of recommendations about what an information seeking support system should do, to enable this sort of exploratory search:

A useful information seeking support system, then, would require the following minimum functionality:

1) It should not interfere with my behavior as listed under Characteristics of Exploration above.

2) It should give me capabilities at least as good as those listed under Manual Tools above.

3) It should positively assist my explorations by making them easier or faster or more comprehensive or less error-prone or…

In addition, an ISSS might give me capabilities that I never employed before because they were not possible or because I didn’t think of them.

But, to be truly a leap forwards, an ISSS would need to exhibit at least elements of discernment, judgment, subject matter expertise, and research savvy.

Again the question: Is there a tension here between getting the users off of the results page as quickly as possible — especially when the route off that page is typically via an advertisement on which the search engine makes money — versus enabling the user to remain on the results page in a process-oriented mode of sorting and filtering and playing around with the results in a myriad of different ways, so as to come up with a set of options that best satisfies the exploratory need?

I’ve already heard certain search engines state that their goal is to get the user off the search page as quickly as possible. That it and of itself tells me that they’re specifically designing the system so as to interfere with behaviors listed under Characteristics of Exploration above (Levi’s first recommendation). Why does it interfere? Because my goal is to stick around in the results and compare and contrast, whereas their goal is to get me off of the page as quickly as possible. And so the whole system is designed to do the opposite of what I want it to do.

Additionally, I was also pointing out that the information domains on which I usually have the largest exploratory-type information needs are very similar to the information domains on which the search engines make most to all of their money. I’m still trying to figure out what to make of that.

Thoughts?

More Information Is Positive

jeremy — Mon, 09 Nov 2009 12:48:48 +0000

Via Greg Linden, I came across this interesting quote from Eric Schmidt about the obligation to help newspapers succeed:

Finally, Eric claimed Google has a moral duty to help newspapers succeed:

Google sees itself as trying to make the world a better place. And our values are that more information is positive — transparency. And the historic role of the press was to provide transparency, from Watergate on and so forth. So we really do have a moral responsibility to help solve this problem.

Well-funded, targeted professionally managed investigative journalism is a necessary precondition in my view to a functioning democracy … That’s what we worry about … There [must be] enough revenue that … the newspaper [can] fulfill its mission.

This is great that Google feels this professional responsibility. And I wholeheartedly agree with Schmidt that “more information is positive”. My only question is: Why don’t we see “more information” and transparency when it comes to other media companies, aka search engines? Newspapers engage in investigative journalism in order to bring stories from industry and politics to the citizens. Search engines engage in algorithmic retrieval in order to bring stories from the newspapers (and other sources) to the citizens. The historical role of the press has been to provide transparency. So also is the modern role of the retrieval engine to provide transparency. And just as a good reporter has to cite sources to make their stories credible, so should a search algorithm provide explanatory interfaces, algorithms, and information to make their results credible.

Shouldn’t there be an expectation of as much information and transparency from our search interfaces and algorithms as we have from our press? It is no secret that I think there should be. It is a goal that I strive for in my own research; I can’t say that it’s not difficult, but it is worth striving for.

Exploration, Collaboration, and Open Government

jeremy — Tue, 22 Sep 2009 14:54:28 +0000

What sort of information retrieval system would you build if you knew that all the users of your system would be expert or highly-motivated amateur searchers? What sort of system would you build when you have a very large collection of unstructured information, and the goal in searching that information is not to find one document (e.g. navigate to a home page), but to find (a) relationships between documents, or (b) large sets of documents that all pertain to a single topic? How would your algorithms be different? How would your interfaces be difference? How would the process itself (that middle layer in between algorithms and interfaces) be different?

Via Daniel Tunkelang’s recent post, I think that Government information might be a perfect domain in which to ask (and answer) these sorts of questions. The U.S. Open Government Initiative has as its goal the release of loads of raw government data for use by any individual or organization. How are people going to use this data? What types of questions will they ask? What types of questions could they ask, if given the proper tools (i.e. what might they not know that they want to ask, until it becomes possible?)

Two types of information retrieval might be perfect for this domain: Exploratory Search and (Explicitly) Collaborative Search. In exploratory search, the goal of your information seeking is to learn, discover, compare, contrast, etc. In explicitly collaborative search, your goal is to do something similar, but with another set of like-minded partners working with you on the same task/topic. Each partner may have different expertise; one may be an expert in energy policy, another might understand trade and commerce, and another might have experience with the inner workings of Congress and understand how it works on a practical level. If you put all these people together right now, the only way they can work together on a shared task is to search separately and then email each other their results. What if, however, you could design a system that not only mediated between them on an interface level (immediate notification of marked documents and passages, shared highlighting of seen documents, etc.) but mediated between them on an algorithmic level as well? Algorithmic mediation of the collaborative process would mean that the retrieval system itself has a hand in both combining and partitioning the inputs and actions of the search team members, as necessary. They might then be able to find important, valuable information that none of the searchers, had they been working alone, could have.

It seems like an interesting domain, and one with real, potentially quite important consequences and societal implications. It will be interesting to watch as this develops.

Breadth Destroys Depth

jeremy — Wed, 26 Aug 2009 13:21:14 +0000

A few days ago I posted a question about why modern web retrieval systems offer no explicit relevance feedback mechanisms. I wonder if it has anything to do with the following attitude, explained by one of my favorite bloggers, Nick Carr:

The problem with the Web, as I see it, is that it imposes, with its imperialistic iron fist, the “ecstatic surfing” behavior on everything and to the exclusion of other modes of experience (not just for how we listen to music, but for how we interact with all media once they’ve been digitized). In the pre-Web world, we not only enjoyed the thrill of the overnight sensation – the 45 that became the center of your waking hours for a week only to be replaced by the new song – but also the deeper thrill of the favorite band in whose work we deeply immersed ourselves, often following its progression over many records and many years. It wasn’t that long ago that buying an album represented, particularly for your average teenager, a significant investment. You thought a lot about that album before you bought it, and once you bought it you took it seriously – you listened to it. Repeatedly. Today, we’re quick to dismiss those ancient days of “scarcity” and to celebrate our current “abundance,” but scarcity had something going for it: it encouraged a deep engagement in listening to a particular piece of music, across the expanse of an album, and it also encouraged, in the artist, an interest in rewarding that engagement. I would like to get back the money I spent on records in my youth, but I would not give up the experience that money bought me.

Perhaps relevance feedback hasn’t been implemented on the web, not because it isn’t useful (it is), not because it doesn’t work (it does), not because it’s too complicated (it isn’t), or not even because it’s too inefficient (depends on the implementation, I suppose). No, perhaps relevance feedback hasn’t been implemented on the web because most people are busy “ecstatically surfing” the web, favoring quick, easy, surface answers over deeper engagement and knowledge. A breadth of popular answers may be more valuable to society than being able to go deeper on any one answer. Carr continues:

It’s the deep, attentive engagement that the Web is draining away, as we fill our iTunes library with tens of thousands of “tracks” at little or no cost. What the Web tells us, over and over again, is that breadth destroys depth. Just hit Shuffle.

I do wonder if there are similarities, and if so, what the social implications are of a society built on breadth over depth. Not that we aren’t mostly there, already. But if we’re building information retrieval systems that purposely accelerate skimming and breadth at the expense of depth, if we increase that feedback loop, what does that portend? It is a question I occasionally ponder as a think about the social impact of my research.