I have a question that has been bothering me, kicking around in my head, for at least half a decade now. And I can’t seem to come to any solid conclusion on it. I suppose it can’t hurt to throw it out here onto the web, and see if one of my 3 readers has any thoughts on the matter.
A large amount of web search effort goes into statistics and probabilistic modeling of user queries and behaviors. There seems to be a generally-accepted, widely-held belief that the way to go about web search is to look at vast quantities of data, and modify search engine parameters based on that observable data. With enough users, and enough data, the search engine can be made better.
The approach seems very reasonable. Where I get concerned is with the iterative feedback loop:
- The designers of the search engine provide specific functionality
- The user utilizes that functionality
- The search engine collects data/logs about the use of that functionality
- The designers use machine learning to analyze those logs for evidence of success and failure, and finally
- The designers modify the search engine to provide improved functionality that better satisfies these user behaviors
- Goto 2.
What I do not see in this whole process is a way for the user to tell the engine designers that they need anything other than what the search engine already provides. From another perspective: Analysis of failure in steps 3&4 seems to only really tell you how well the user did when using the existing functionality. It does not tell you that the user had an information need, the satisfaction method of which falls outside of that existing functionality.
So how do (or how should) the developers of a search engine learn to recognize user needs and behaviors that fall outside of the what the users are able to say (through log analysis)? I am assuming that user-interviews are too small-scale to gather the necessary information, at web scale. And A/B testing only works when you’ve got a B that you’ve designed in reaction to a known behavior. But how do you develop that B, if you’ve never observed the behavior that will make use of B in the first place.. not because that behavior doesn’t exist, but because you cannot observe it until B exists? In short, when adding interactive features and capabilities to a search engine, how does one take proactive action in the development of those features, rather than only re-active reaction?
How the developers of a search engine learn to recognize user needs and behaviors that fall outside of the what the users are able to say (through log analysis).
Actually, user interviews should be a useful way to gather ideas that can then be validated at a larger scale. And Google does offer a feedback form. I wonder how often people use it. Those users are unlikely to suggest innovative solutions, but at least they might pose the problems that inspire Google to develop them.
Ok, FD: Next question: So why isn’t anything done about it? 😉
Sorry, Fernando, that was a flip response. What I mean is, the user can’t always articulate what it is that the search engine is *not* doing, to help them find the information that they need. Take my ongoing “prague cafes” example (http://irgupf.com/2009/04/23/retrievability-and-prague-cafes/). If I were an average user, I might write something like “I don’t know if I’ve found everything that I need”.
But it’s quite a leap to go from that user statement, to the notion that some sort of exploratory interface and supporting algorithm is needed, never mind even what interface and algorithm would be best for satisfying this type of “off the beaten path” cafe search.
But yeah, as Daniel says, it could at least be a starting point to get engineers to think about new problems.
Looking at your feedback loop in the post, it seems like the typical target of refinement would be failures within the system itself. But, it sounds like you want to be looking at failures of the *process* that the system supports (or forces users into).
There’s an art to identifying those types of failures — as you’ve said, users may not know how to articulate what the failure may be, or may not even perceive it as a failure since they’re using the provided tool as it was intended to be used and as it is used by everyone else. Most people probably don’t even think there might be a better way, they just work around the provided constraints.
This is the case with all design, not just information seeking systems. Check out this video of the president of Oxo talking about their design process, especially the bit about their measuring cup. No one perceived the tool as lacking until the whole process was evaluated. Even then, it took a real stroke of creativity to break out of the established process.
oops — the video is here:
Yes, exactly, Jon. Failure in the process, rather than failures within the existing system. Everybody raves about A/B testing and log analysis, but it seems to me that even the people doing the log analysis are only looking for failures within the existing system.. and if a searcher is trying to do something outside of the constraints of the system, no amount of machine learning, user modeling, or log analysis will detect that.
Everyone talks about how “user driven” they are, and so there must be some sort of process for understanding and measuring situations wherein the process itself is not working. That’s what I’ve been seeking to understand for many years now.
So is your Oxo example the answer? There really is no process for detecting “out of band” processes, and it is up to the search engine designers to come up with that stroke of creativity?
I would recommend watching Jon’s video, above, starting at about 16.30, the example with the measuring cup. No user could articulate that a problem with their current measuring process existed, until they were provided with a better solution.
There’s certainly a huge utility to A/B testing and log analysis. A/B testing can very effectively support incremental system changes (A and B need to be comparable), and log analysis is essential for understanding how people use the existing system & possibly influencing what the incremental changes might be.
But, I agree with you that they’re not sufficient for understanding process failures. If you want to understand these with automatic machine learning methods, I’d guess that a much broader view of a “log” is need. For example, a capture all the actions associated with an information seeking task not just in the search engine but in the word processor, on your blog, on twitter, in your code editor, on your phone, and even personal interactions away from the computer. Its hard to imagine collecting this data at all, much less on a reasonably a large scale.
Pingback: Information Retrieval Gupf » Wired Article on Bing
Yeah, let me clarify: For the majority of search engine scenarios, for the average of all users, the standard log analysis and A/B testing is just fine. There is no reason why it shouldn’t be the preferred method, in fact, give the scale of the web. I never intended to imply anything but it being the way to go, a decent percentage of the time.
It’s when you start to get to the long tail of information needs, the variety of users that want to use the information on the web for something other than looking up a home page. In those learning oriented, knowledge skimming- and synthesis-oriented, analysis-oriented types of search tasks, simply mining the logs may not give you the information you seek.
For example, one person (let’s say “User A”) might run 6 queries in a row, constantly clicking on the back button and trying another one, because they can’t find the “correct” query to get them to the known-item piece of information that they are seeking. That would be a failed search, using existing tools. However, another person (let’s say “User B”) might run 6 queries in a row, quickly skimming the summaries for the top few results to each query and then constantly clicking on the back button and trying another query, because user B is attempting to probe the conceptual boundaries of a topic area. That would be a successful search, but one in which the tool was used quite awkwardly, ie. the search engine really didn’t provide the right tool, so the user had to simulate a boundary-probing search tool, manually. Like using the prongs of a hammer to grab hold of and screw in a screw.. you can do it, but it ain’t pretty.
However, from the perspective of the log analysis, both users A and B look exactly the same. Both try 6 different queries, in rapid succession, and don’t click anything after running each query. But user A considered his/her search a failure, and user B considered his/her search a success.
So I can’t say for sure, because I am not at one of these companies. But it seems to me that it would be very difficult to tell these two types of users apart, to know for sure from the log analysis which if both users were of type A or type B or one of each. And even if you did user interviews, and discovered that user B exists, you still have to rely on log mining to know exactly how many people of type user B there are. And in that case you’re back to the old problem of knowing how to tell user A and user B apart in the logs, when their behaviors look exactly the same.
They also failed to keep up to date with the new trends, such as the stream, which created the need of real-time search. I think it is impossible for search engines to infer somehow things like this that are out of their scheeme function.
Pingback: Information Retrieval Gupf » Speed Matters. So Does the Metric.