Running split tests is addictive. You develop a theory about something that’s not working as hard as it could be on your website, and that becomes your new test idea. You develop the creative – a copy improvement, a usability fix, or perhaps a product repositioning – and then you launch your experiment. The excitement is palpable!
Here’s the addictive part: Whether you’ll openly admit it to others or not, I’m guessing you’ll check the testing tool multiple times per day – and more than likely, multiple times per hour – to see how your revised page is performing. It’s hard not to check.
And you want so badly for your testing tool to declare a winner. When it’s your test idea, it’s easy to let your emotions take over. If your new recipe pulls ahead of the default page, you feel confident, amazing, invincible. But if your recipe falls behind the default content, it’s like someone came along and shot your dog.
Trust In The Tool
Adobe Test & Target, Google Content Experiments, Optimizely, and Visual Website Optimizer are all great options for running your testing program. There is a tool for every budget. And no matter which one you go with, once you have a test underway, it will at some point – statistics allowing – declare a winner.
As subscribers to these tools, we (start-up founders, marketers, and product developers) trust what they tell us. I’d argue that the more you pay, the more you trust what the tool tells you.
Unless you’re a statistics expert, the tool is the authority. You rely on it for advice on when to stop a test. So when the tool congratulates you on achieving a winning variation, what do you do? I’d bet my next paycheck on the fact that you’ll take the money and run.
After all, why would you let the test continue to run if it’s telling you there is a 99.6% chance of beating the original page? That number seems as close to 100% as one would ever hope to achieve with a test. In my experience, most marketers will bank a win at 95% confidence… even 90%. But be careful – that lift as reported by your tool of choice may not be what it appears.
Wait, Where’d That Lift Disappear To?
If you’re a smart marketer, you probably spend time reading about other people’s tests. And if you’ve been reading recently, you may have come across a post from Neil Patel, where he poses the question (within a detailed post about his testing results in general), “Where did my lift go?”
Neil, it’s quite possible (even likely) that you’re not seeing the lift in sales or revenue from your test because it was never there in the first place. You may have unknowingly received a “false positive” in your test – known as a Type I statistical error, otherwise known as an incorrect rejection of a true null hypothesis. That’s a mouthful, so I simply remember it as a false positive.
False positives are insidious because they generally result in the experimenter taking action based on something that does not exist.
In the pharmaceutical business, you can imagine how much damage would result from companies acting on false positives during drug testing.
While perhaps not as economically far-reaching or emotionally damaging as giving patients false hope, acting on false positives in your web tests could, at the very least, put you in a sticky situation with your boss, senior leaders, or investors. Worse than that, it could turn you off testing altogether.
I personally cringe at the idea of getting false positives because I always expect to learn from tests. A false positive for Copy Hackers means we think we’ve learned something about our visitors – when in fact there was no learning. We end up going down a dirty rabbit hole as we try to apply that learning throughout our site and other marketing materials (e.g., emails).
On the other hand, a false negative is generally benign. It means you’ve missed the opportunity to take action on something real because it was not revealed as part of your test. You don’t take any action, so you’ve really lost nothing (unless you factor in opportunity cost).
Example Of A False Positive
Recently Joanna and I decided to run a simple two-way split test on the Copy Hackers home page.
Here is the default version of the home page hero section:
Our desired measure of conversion was clicks on the primary call-to-action (i.e., clicks on the big green button) by new visitors to the home page. But because there are many ways into the site from the home page, we created an alternate conversion metric: engagement. Engagement simply means that a visitor clicks any link on the home page. Think of it as the opposite of a visitor bounce.
We launched the test on November 11th, 2012, and for the first 2 days, we saw a lot of fluctuation in the performance of the two pages. Then the performance settled into a nice rhythm, until after 6 days, our testing tool declared a winner, with 95% confidence. Knowing what we know about confidence levels, Joanna and I let the test run for another day, just to be sure – after which the tool calculated a 23.8% lift (we have a winner!) at a confidence level of 99.6% (meaning that there is only a 0.4% chance of a false positive):
Were we excited? Stunned is more the word.
Why? Because our split test was an A/A test – not an A/B test. In other words, the tested variation was identical to the default page… to the pixel!
On occasion we’ll run an A/A test to validate that the results will turn out as we expect. And most of the time, there are no surprises. But not this time. With nearly 100 conversions per recipe and a week’s worth of data, an identical copy of the home page was declared a substantial winner over the default page.
There is virtually no way to predict such an outcome, in part because it happens only a fraction of the time, but more importantly, because most tests involve two or more different variations – and there is no way to know for sure if you’ve received a false positive outside of allowing the test to continue to run for a longer period.
But our experiment clearly illustrates that popular testing tools still have plenty of room for improvement.
Here is what happened when we let the A/A test continue to run:
As you can see above, on about day 12, the two conversion rates converged, and the lift disappeared completely.
How Does This Happen?
I am not a statistician, and the point of this post is not to teach you statistics (see a detailed explanation of the issue here).
Evan Miller, the author of the above-mentioned post, explains that accurately measuring significance requires that your sample size be fixed in advance of the experiment. But that’s not what happens when you run your tests. Instead, you let a test run until the tool proclaims that you have a significant difference. And in order to make that calculation on the fly, the tool must make repeated significance tests – which are actually flawed.
In fact, the more frequently the tool tests for significance as the test progresses, the more inaccurate the calculation becomes – and you end up with a far higher probability of seeing the dreaded false positive.
What Do You Do Now?
For starters, keep testing using your tool of choice!
Joanna and I have witnessed massive, real conversion gains on our clients’ websites – validated through multiple iterations and by reconciling conversion data with financial data. This risk does not, in our opinion, outweigh the amazing benefits of continual optimization.
But just knowing that false positives are a possible outcome will benefit you.
For example, knowing this may cause you to let a test run longer (i.e., beyond the point at which the tool tells you it’s okay to stop the test). Or armed with this information, you may decide to run a test multiple times longitudinally.
Our recommendation is to calculate the sample size (i.e., number of visitors) required to accurately assess your test data – before you launch the test. Put another way, you’ll want to pre-determine the duration of your test (based on a number of required visitors).
And to help, here is an excellent post by Noah Lorang at 37signals on how to calculate the desired sample size for your next test. If you’re concerned about getting a false positive like we are, use Noah’s formula to arrive at the exact number of visitors who will need to enter your experiment in order to determine whether or not you have a statistically meaningful difference between your 2 (3, 4, 5, etc.) variations.
For the toolmakers, we’d challenge you to solve this problem around confidence. Not statistical confidence, but in solidifying the confidence people place in your tools to guide them on when to stop a test. Can you implement a new test set-up experience that will save people from making costly mistakes like acting on a false positive – even something as simple as a sample size calculator? Given the similarity in how popular testing tools declare a winner, developing a new user experience could be a key differentiator for you.
~lance
Conversation