I have more questions than I have answers. One of the topics that I know very little about, and on which I often seek clarification and wisdom, is A/B testing in the context of rapid iteration, rapid deployment online systems. So I’d like to ask a question of my readership (all four of you 😉 )
Suppose Feature B tests significantly better than A. You therefore roll out B. Furthermore, suppose later on that Feature C tests significantly better than B. You again roll out C. Now, suppose you then do an A/C test, and find that your original Feature A tests significantly better than C.
What do you do? Do you Pick A again, even though that’s where you started? Roll back to B, because that beats A? Stick with C, because that beats B?
I’ve worked on enough low-level retrieval algorithms to have seen things like this. Changes and improvements do not always monotonically increase. When you run into a loop like this, what do you do? Sure, it would be nice to come up with a Feature D that beats all of them. And in an offline system, with no real users depending on you minute-by-minute, you can take the research time to find D. But in the heat-of-the-moment online system, one in which rapid iteration is a highly valued process, which of A, B, or C do you roll out to your end users?
Could you clarify this for me? It sounds like you have some kind of objective (empirical) performance measure as the basis for “tests (significantly) better”. If you do P(A)<P(B)<P(C)<P(A) should be impossible.
If on the other hand each pairwise test uses a different performance measure (due to context / outside influences, etc.) and you get P1(A)<P1(B) and P2(B)<P2(C) and P3(C)<P3(A), my guess is that you should seriously start to question the validity of your performance measures.
Jens, good point. Let me try and clarify:
Traffic, traffic, traffic. I hear from online folks that it is crucial to make sure that users keep using your service. So let’s say that our objective is a function the number of users and the number of queries those users run, per day.
For example, suppose we have 500 million users. For simplicity’s sake, let’s group them into 5 groups, and pretend that all 100 million users in each group behave exactly the same. (I’m just doing this to make the example tractable.) Now, under System A, suppose the users in each of the 5 groups execute the following number of queries:
{ 5, 6, 2, 5, 2 }
Under System B, these users execute a different number of queries:
{ 6, 7, 3, 3, 1 }
So System B offers improvement over System A for 300 million users (3 of 5 groups, i.e. groups #1, #2, and #3) and makes things worse for 200 million users. More people are improved than worsened, and it’s unlikely to be due to chance.
Now, System C comes along and yields this query frequency pattern:
{ 10, 2, 1, 4, 3 }
Again, System C offers improvement over System B for 300 of 500 million users (groups #1, #4, #5). Again unlikely to be due to chance.
But now compare System A to System C. System A is better than System C for 300 million users; see groups #2, #3, #4.
This is information retrieval. You’re never going to have a system that offers improvement on every single query for every single person. That’s just the nature of the problem. So what you try to do is maximize improvement for the most number of people. And each each case, going from System A to B to C, you keep making improvements that affect the most number of people, relative to your baseline system in a pairwise A/B test comparison. But then it loops around back from C to A, and you still get that pairwise improvement for A over C. Only you’re now back where you started.
So where is my reasoning breaking down?
Ok, now I see what you mean. I guess that the comparison criterion “improvement for most cases” is problematic.
In information retrieval you have e.g. “average precision” as a per case (query) metric, and when comparing systems you use an aggregate that best reflects what’s important to you. The typical ones would be “mean average precision” (MAP) and “geometric mean average precision” (gMAP), where the latter gives more weight to low-performing cases.
So I would probably discard the pairwise comparison criterion and replace it by an appropriate aggregate measure. That way you are sure to avoid the transitivity problem in you comparison.
Whether it’s easy to come up with a good aggregate is a different question, of course.
Or you could do what I tried for my doctorate thesis: apply different strategies on a per-case basis.
I suspected that your response might involve some sort of averaging procedure. 🙂 If you look at my example above, you’ll see that the total number of queries under each system (A, B, and C) each sum to 20, and thus average out to the same value: 4 queries per user, per day. From the average perspective, there is zero difference between each of the systems.
But averaging, and even geometric averaging, obfuscates what you’re really after, imho… which is that you want to improve everyone’s experience. Or as many people as possible. So while algebraically problematic, the goal itself is good.
Doing what you say, i.e. discarding the pairwise comparison, would indeed work. But let’s face it: some people are heavy queriers (dozens of queries per day) and some are light queriers (1-2 queries per day). What you really have is a bimodal (or trimodal? k-modal?) distribution of query frequency habits, and averaging the heavies in with the lights, even geometrically, destroys that information and doesn’t give you as clear of a picture of what is going on.
I think you’re correct about applying different strategies on a per-case basis. I.e. perhaps do something like detect whether someone is a “heavy” or a “light” user, and then give System A to the heavies and System B to the lights (and System C to the mediums?) That brings up all sorts of different questions though: If the change between System A/B/C is an interface change rather than a hidden, backend algorithmic change, will it throw off users to see different interfaces when they’re logged in, versus when their Mom is logged in? And how do upgrade a light to a medium, a medium to a heavy? Do you give the users the ability to opt back out, if you automatically re-classify them? More answers bring more questions 🙂
Could you paste a link to your doctoral thesis? I’d be interested in reading it.
Just to clarify: When you say discard the pairwise comparison, you’re talking about the group #1 through group #5 pairwise comparison? Not the System A vs. B, B vs. C, C vs. A pairwise comparison? Right? Because the latter is the whole point of A/B testing: You do pairwise comparisons.
I mean that once you move from “better for most users” (which only allows for pairwise comparisons between systems/features) to an aggregate measure you should get a linear order between all systems (plus some noise from the fact that the A/B test is done in different conditions than B/C).
This leads back to the other point: if you do not get a linear order on the aggregate performance measure, your measure is not sufficiently robust and overwhelmed by the differences between the separate tests.
To me, trying to develop an aggregate measure makes sense. It makes you think about what changes are important: “unusable to unusable” or “excellent to slightly differently excellent” does not have the same impact as “bad to good”.
I think that your pairwise comparison of “better for more users” is not robust and can lead to circular results because it is not a good measure of performance.
P.s.:
my thesis is unfortunately only available in French and I never made a journal paper out of it, but you can get a short paper here: http://www.haifa.ibm.com/sigir05-qp/ about predicting the difficulty of queries (and quality of retrieval results).
Given that query expansion through BRF doesn’t work if your initial results are too bad (expands with the wrong terms) or too good (no possible improvement), I tried to train classifiers to decide when to apply BRF depending on the query, but I never obtained very satisfying results. I still think that system selection depending on initial analysis can be very interesting, though.
Addendum: you may of course want to look at different (aggregate or not) performance measures reflecting different aspects and then decide whether you are willing to sacrifice top (power user) performance to have better average performance or not, etc.
Jens, I’ve been having a conversation off-blog with another reader/commenter. And he keeps pointing out to me that even if you have aggregate performance measures, you still might encounter the situation that I’m describing: When the stochastic process that the measure is trying to quantify is non-stationary (the signal changes over time) you still might get B better than A in March, and A better than B in July — because you’ve rolled out B in the intervening time period.
The problem is, I don’t know any process that is completely stationary. When people look for information, they find that information and that information changes their internal state (tweaks the parameters of their stochastic processes). They might not act the same way in the future that they did in the past. When people see an interface change, that also changes their internal state. So the stochastic processes (the actions generated by the users of the system) are in constant, non-stationary flux in the presence of a constantly changing system and information.
So that’s really the core of my original question above. The whole purpose of rapid iteration, rapid deployment, A/B testing is to do a test, pick a winner, and then implement/roll out that winner. Once you roll out the changed system, though, you’ve lost stationarity, and therefore your previous A/B test is no longer valid. You might do another A/B test in two months and get completely different results. Not even a B/C or C/A test. But the same A/B test might, in two months, give you a completely different result. Given that fact, what do you do? You rapidly deploy B when it beats A. But then do you turn right back around two months later and rapidly deploy A when it beats B again? (And then flip back to B a few months after that, etc?) Or do you wait a while to deploy changes? I can see that waiting to deploy gives you more confidence in the stability of B over A, in your test. But once you’ve rolled out B in full, and everyone has grown accustomed to it and have had their internal state changed because of it, the validity of that first A/B test is still goes out the window (because of that state change).
That’s my question. That’s what I am seeking more understanding on. I get A/B tests. I get rapid deployment. It’s the interaction between A/B tests and rapid deployment that (I think) causes problems. How are those problems handled?
Lets imagine you did an A/B test on christmas eve, and you got statistically significant results regarding the landing page for last minute Christmas gift delivery.
Happy as you are about these results, you wouldn’t generalise the results to the entire of 2010. User behaviour changes seasonally, and there are lots of other temporal externalities -Has other stuff changed on your site? Has a dominant competitor appears that does things in a certain way, thus setting user expectations?
Correct me if I’m wrong, but this seems like the core issue when testing A/B at one point in time, and then B/C at another point in time. Lots of extraneous variables.
Harry: Yes, I think you’re correct. It’s because things do change all the time, not in the least the users themselves. You’re Christmas eve example is extreme, and the point is taken. However, things change even within seasonal boundaries. The moment you roll out B, after it successfully wins against A, you’ve effectively changed the state of the world. The moment you change that state, and users start adapting and changing their behavior as a result of that new state, all old bets are off. And so C might later beat B. And then when you roll out C, after some period of time A might then beat C.
So I’m back to my original question: What do you, as a system designer, do in those situations? When you observe this sort of loopy behavior, do you go back to your original A, even though you’d given up on it before? Or do you avoid going in loops, and instead try to find a better D that beats all of A, B, and C? And what if you can’t immediately find D? Do you stick with C until you can?
That’s my question.