A/B testing needs a theory

Randomized controlled experiments (A/B tests) are frequently said to be the gold standard for decision-making. However, to make a good decision, you have to really understand what question is being answered.

Let’s say, hypothetically, that you want to try a new color palette for your website. You have the mechanisms to run proper A/B tests: you can bucket users into A and B groups such that the only difference between them is the color palette. You choose, arbitrarily, a random palette, run the experiment for a week, and find that the users in the B group have a $200k lift in revenue, which annualized represents a $10M increase in revenue. That’s not nothing. You roll it out to all users, write your performance review, book a party and wait for your promotion.

There’s just one important thing missing: you have no idea exactly why it worked. Here’s a few possible explanations, some of which are quite far-fetched:

Maybe the new palette just randomly made your site look more like its main competitor, and users that would normally leave your site keep using it, blissfully unaware. The next time your competitor changes their color schema, these users will not be confused anymore and will leave. And maybe they’ll do this next week, because they just realized they lost some users.
Maybe the color palette of the world outside makes your new palette very attractive, but only for the season when the test ran, but it makes it less attractive in other seasons of the year. With the change you’re actually losing money for the year. That’s a little bit of a joke explanation, but it has a strong scientific basis (the “crocs and socks illusion”, also the “dress illusion”).
Maybe you found a cheat code, the exact combination of colors that makes other people understand you better, and could use this to improve educational outcomes and save the whole world $1T per year, rather than $200k for that week. By deploying it to your site and moving on, you have a lost opportunity cost of approximately $1T.
Maybe you really found something that only makes it easier for users to use your site and that it’s going to work well for one year or more, so it’s really valued at $10M per year… as long as you don’t make any other changes to the site.

Your A/B experiment is on the top of the evidence pyramid, but for which of the cases above? Without a theory and alternative explanations, you may never know.

There is, however, a hint that you learned something, and that it’s familiar to data scientists: the descriptive, predictive and prescriptive/causal framework. When you really understand how something works, if you have the resources, you can make it happen at will. For example, if the reason why you got a lift in your experiment is because you were looking like your much larger competitor, you could now mimic their color palette every time they make a change and keep getting the lift. You could mimic palettes of other larger competitors and get lifts there, too.

A/B tests are great evidence when you understand the question. Better make sure you do.