The Tragedy That Was My Summer of Running A/B Tests

Welcome to Tales from a Testing Newbie. I’m your host, Jen Havice.

I’m a copywriter.

Before the summer of 2014, I’d only run a couple of split-tests… and I desperately wanted to do more.

Why? Why did I, like so many marketers, want to dive into the deep end of conversion rate optimization? I think it’s because it’s sooo easy to get caught up in the promise that testing holds. The tales of 2xing your revenue. The tales of anger. The tales of gargantuan increases in revenue. The tales you forward to a coworker, or bookmark for later reference, or – in the really worthy cases – comment on and tweet out. (**cough, cough**)

You’ve probably seen – and questioned – some flashy test wins yourself, right? Like the 123.9% increase in clicks that Joanna discussed here, shown below:

Why is split-testing so hard?

Sometimes we get a window into the big losers, like this one you may have seen on Copy Hackers. Here, Joanna ran 4 variations against the control for Metageek. She was testing to see if using buttons would outperform text links… but here’s how that worked out:

Losing a/b tests

The upside of reliving “losing tests” is that we all get to benefit from a teachable moment someone else had to suffer through. Whether winners or losers, these case studies leave you – or, at least, leave me – with a sense that testing is THE ticket to actionable insights. Develop a hypothesis, craft at least one treatment, implement a test, and start rolling in the hard data.

Except That It Doesn’t Always Turn Out the Way You’d Hoped

So this summer, I worked with Joanna on her second-annual summer of split-testing, where we partnered with 13 pre-selected startups and tried to optimize their copy.

In the best of cases, running an effective A/B test means coordinating a lot of moving parts. In the crazier cases – like working with 13 startups at the same time – it means getting caught up, chewed up and spit out by those moving parts. No matter how well I thought I was prepared for problems that might crop up in our tests, I still found myself surprised by the process and all the challenges that came along with it.

Among the things I’d never expected:

People get really worried about SEO – so worried that they don’t even want to test removing or changing copy lest Google swoop in and punish them
Some people will let you change almost anything if it might bring in a lift
A lot of seemingly small shops are amazingly skilled in the ways of CRO and copywriting
A/A testing doesn’t just check to be sure your code’s installed and tracking right; it can also give you a sort of map of what to expect in traffic blips, etc before you ever run an A/B test
The free trial for VWO and Optimizely brings marketers in… but then they try to shut down the tests before the 30-day trial is up… because I guess it’s better to save $59 than it is to learn and grow???
It’s the rare case when someone actually hard-codes the winning treatment as their new control

But I’ve left out the biggest shocker. Because it’s so big, it deserves its own crosshead and a funky monicker. I give you:

The Crap-tastic Traffic Conundrum

It is impossible to run a test on a website that has low traffic, low quality traffic and/or low conversions. No surprise, right?

Everybody knows that testing takes lots o’ good traffic… but testing platforms casually disregard that critical point. (Hello, Optimizely and VWO, I’m looking at you!) How on earth is a non-Amazon business supposed to “test everything” when you can barely test your home page headline and reach confidence?

Peep Laja Talks Testing at CTA Conf

This September, at Unbounce’s terrific CTA Conf, Peep Laja (whose ConversionXL intensive course I took early this year) recommended that we aim for 250 conversions per treatment before we shut down a test. Two HUNDRED fifty. I don’t know if that sounds like a tiny number to you, but for a little perspective: most startups would see the apocalypse before they’d see that test reach confidence.

I get it – makes total sense. You’re supposed to run your test to at least 95% confidence. To get there when you have low traffic and/or low conversions, your new treatment needs to knock it out of the park. As in, it needs to soar high above the Green Giant and land somewhere in the Atlantic Ocean. That’s the only way to reduce the amount of time a test will take to get to significance.

Which means this: you cannot test everything. “Test everything” is a myth, and I’m convinced that anyone who says “test everything” has never conducted more than 1 test in their career. Most sites can only test the very best possible treatment… but how do you know it’s the best before you even test it? I mean, the cards are really stacked against you.

Problems with split-testing calculators

I’m about as far away from being a statistician as you can get, but even I could grasp what was happening with the sites we found to have less traffic than we were originally led to believe. Depending on the goal – such as clicking on a sign up button or watching a video – either:

There weren’t enough conversions per variation, or
The difference in conversion rates per variation was so small that the data wasn’t telling us much of anything… other than that our copy changes were far from making the kind of impact we’d anticipated.

Thanks to low traffic, we had to abandon testing on more than one site when our month of experimentation was up. I’m sure I’m not the first person to be frustrated by this. Small, growing businesses are super-screwed when it comes to testing.

A/B Testing Can Feel Like You’re Riding a Roller Coaster
After Eating One Too Many Chili Cheese Dogs

You need to monitor your tests regularly… but not too regularly. You’ve got to check in to know what’s happening with your tests and if they’ve reached significance. But I quickly learned that checking in daily can get reallllly frustrating.

That’s because upticks in conversion lift often disappear like smoke in the rain. One day you may feel like a rockstar seeing a 20% bump from your treatment… only to find, a few days later, that you’re on the losing end of that proposition. Your test is tanking.

Of course, the opposite can happen too. A loser can turn into a winner.

What I learned is to not react to early data. Remain objective until the testing tool calls a winner, and then double- and triple-check the testing platform’s math with calculators like these ones. Remember that 80% statistical significance is no better than 50% in terms of calling a test definitively. If it’s not greater than 95% with a barely-there margin of error, the test isn’t done yet.

(To get a better understanding of why this is so important, take a look at Evan Miller’s post, How Not to Run an A/B Test.)

In the end, I had to let go of my attachment to the treatments we created. Some did well – others not so much. Which leads me to…

Why Toss When You Can Iterate?

Most of the treatments we developed were new headlines. Writing headlines is probably one of my favorite things to do. They can also be one of the toughest things to do well.

No surprise, I labored over the copy suggestions I made. Not only did I want to make Joanna proud (editor’s note: awww), but I wanted to help prove our hypothesis. Here are some that we came up with:

Luzme

Softorino

Sprout

As far as I’m concerned, there’s no reason why a headline filled with personality that packs a punch can’t skyrocket conversions… especially when it’s competing against an oh so bland counterpart. Unfortunately, it didn’t pan out the way we’d hoped with some of our tests. As Joanna mentioned in her headlines vs. buttons post, we found our bolder variations trending up but not reaching a point where we could draw any sound conclusions. Instead of abandoning these tests, she had the idea to use the same headlines but change the corresponding call to action/button copy to make more of an impact. In Dressipi’s case, the reformulated treatment created an enormous lift, and the Softorino one (shown above: Take the Suck…) also saw lift then.

The first tests started out with a meh reaction from visitors… so we tweaked, iterated, and learned. If we had simply tossed out what wasn’t working, we would have lost not only a big win but an amazing new insight.

As a final thought here, it seems that there’s room for improvement re: the metrics we use to gauge a winning or losing treatment. A headline may be really great, but if the button is undesirable, then it looks like the headline isn’t working. But the headline is distinct from that button. So are these testing platforms really measuring success of an element… or are we settling, as marketers, for limited split-testing tools? I wonder.

Conclusions from a Summer Full of Split-Testing

It’s all well and good to write what you think is phenomenal copy. Finding out whether or not it actually gets the job done is even better. Testing tools help enable that.

But.

A/B testing isn’t the be all end all. It has its limitations and its frustrations, and there’s a lot of room to improve. So if you decide to slide down this particular rabbit hole, keep in mind what this newbie learned and won’t soon forget:

Patience is a virtue. Be prepared to wait, and wait… and wait some more before drawing conclusions.
Swing for the fences when your traffic is low. Consider making larger or bolder changes based on your hypothesis. When you’re Amazon, you can test small changes resulting in minimal lift because you’ve got the numbers to get a result quickly and even a small lift will make a considerable impact on the bottom line. For the rest of us? You might hate split-testing less if you test dramatically different designs.
Learnings come from wins and losses. It’s so easy to let our egos get in the way. Who wants a losing test? But if you can take what you’ve learned and improve on it, your business will be better for it.

Many thanks to Joanna for the opportunity to work with her. I’m still pinching myself.

I Spent All Summer Running A/B Tests, and What I Learned Made Me Question the Whole Idea

Except That It Doesn’t Always Turn Out the Way You’d Hoped

The Crap-tastic Traffic Conundrum

A/B Testing Can Feel Like You’re Riding a Roller Coaster
After Eating One Too Many Chili Cheese Dogs

Why Toss When You Can Iterate?

Conclusions from a Summer Full of Split-Testing

Similar Articles

Related

Conversation

I Spent All Summer Running A/B Tests, and What I Learned Made Me Question the Whole Idea

Except That It Doesn’t Always Turn Out the Way You’d Hoped

The Crap-tastic Traffic Conundrum

A/B Testing Can Feel Like You’re Riding a Roller Coaster After Eating One Too Many Chili Cheese Dogs

Why Toss When You Can Iterate?

Conclusions from a Summer Full of Split-Testing

Similar Articles

Related

Get more like this, whenever we publish brand-new stuff.

Conversation

A/B Testing Can Feel Like You’re Riding a Roller Coaster
After Eating One Too Many Chili Cheese Dogs