A couple of weeks ago I was talking to a marketer who was excited about the results of her most recent round of A/B testing. She’d hypothesized that a small tweak to the copy on a landing page of her client’s site would increase conversions, so she implemented a split test and waited for the results to roll in.
A/B testing is a useful way to gather evidence about the effectiveness of web design decisions, and I was interested to see her results. Or at least, I was until she said that her idea was so effective that she ended the test after three days and implemented the changes she’d suggested across the site.
It’s not the first time I’ve come across a marketer who thinks like this, and it’s understandable. We all want results and we want them quickly. Why wait around if a change to a website produces a positive outcome? If we delay implementing positive changes for all leads, we’re missing opportunities to make conversions.
If you know anything about statistics, you’ll see the problem here. If the sample size of a test is too small, there’s a chance the result is a false positive. In fact, if you end the test too soon, it has almost no statistical power — you might as well have tossed a coin to decide whether the control or the variation is the winner.
Let’s use coin tossing as an example. The chances of getting heads or tails on a coin toss is about 50/50. If you tossed the coin 1,000 times, you’d be quite surprise to see it land on heads 900 times and tails only 100. You’d suspect that the coin wasn’t properly balanced. On the other hand, if you tossed the coin five times and it landed on heads four times, you wouldn’t be all that shocked. You wouldn’t conclude that the coin was somehow biased. We all know intuitively that small samples can tend to extremes.
The law of small numbers tends to be forgotten when we’re under pressure to produce results. My friend’s incredible stroke of luck when she hit on just the right copy to elicit increased conversions was likely just that — luck. A random variation in conversion rates over a short timespan that isn’t predictive of future performance.
What would have been a reliable test? Optimizely has a great tool for finding out. Its sample size calculator will take the control group’s current conversion rate and the conversion rate change you’d like to be able to reliably detect. It then tells you how many sample events you need to be reasonably sure that the results are significant.
If you start with a baseline conversion rate of 5% and you want to be 95% sure you can reliably detect a change of 10%, you’d need 31,000 samples. The smaller the conversion rate change, the higher the number of sample’s you’d need. If you don’t think you need to be 95% certain, you can choose lower levels of certainty, but that increases the chances of misleading results.
It often comes as a surprise to marketers that the sample sizes need to be so large, but you can’t argue with the math. If you go to your client with what looks like a great result, but you’ve only collected 500 samples, it’s very likely that any changes they make to their site on the basis of your advice will be neutral at best and harmful at worst.
If you want to learn more about properly constructing A/B tests, Martin Goodson wrote an excellent article that covers sample sizes, the risks of multiple simultaneous testing, and regression to the mean.