Growth Marketing

I Spent All Summer Running A/B Tests, and What I Learned Made Me Question the Whole Idea

Welcome to Tales from a Testing Newbie. I’m your host, Jen Havice.

I’m a copywriter.

Before the summer of 2014, I’d only run a couple of split-tests… and I desperately wanted to do more.

Why? Why did I, like so many marketers, want to dive into the deep end of conversion rate optimization? I think it’s because it’s sooo easy to get caught up in the promise that testing holds. The tales of 2xing your revenue. The tales of anger. The tales of gargantuan increases in revenue. The tales you forward to a coworker, or bookmark for later reference, or – in the really worthy cases – comment on and tweet out. (**cough, cough**)

You’ve probably seen – and questioned – some flashy test wins yourself, right? Like the 123.9% increase in clicks that Joanna discussed here, shown below:

Why is split-testing so hard?

Sometimes we get a window into the big losers, like this one you may have seen on Copy Hackers. Here, Joanna ran 4 variations against the control for Metageek. She was testing to see if using buttons would outperform text links… but here’s how that worked out:

Losing a/b tests

The upside of reliving “losing tests” is that we all get to benefit from a teachable moment someone else had to suffer through. Whether winners or losers, these case studies leave you – or, at least, leave me – with a sense that testing is THE ticket to actionable insights. Develop a hypothesis, craft at least one treatment, implement a test, and start rolling in the hard data.

Except That It Doesn’t Always Turn Out the Way You’d Hoped

So this summer, I worked with Joanna on her second-annual summer of split-testing, where we partnered with 13 pre-selected startups and tried to optimize their copy.

In the best of cases, running an effective A/B test means coordinating a lot of moving parts. In the crazier cases – like working with 13 startups at the same time – it means getting caught up, chewed up and spit out by those moving parts. No matter how well I thought I was prepared for problems that might crop up in our tests, I still found myself surprised by the process and all the challenges that came along with it.

Among the things I’d never expected:

  • People get really worried about SEO – so worried that they don’t even want to test removing or changing copy lest Google swoop in and punish them
  • Some people will let you change almost anything if it might bring in a lift
  • A lot of seemingly small shops are amazingly skilled in the ways of CRO and copywriting
  • A/A testing doesn’t just check to be sure your code’s installed and tracking right; it can also give you a sort of map of what to expect in traffic blips, etc before you ever run an A/B test
  • The free trial for VWO and Optimizely brings marketers in… but then they try to shut down the tests before the 30-day trial is up… because I guess it’s better to save $59 than it is to learn and grow???
  • It’s the rare case when someone actually hard-codes the winning treatment as their new control

But I’ve left out the biggest shocker. Because it’s so big, it deserves its own crosshead and a funky monicker. I give you:

The Crap-tastic Traffic Conundrum

It is impossible to run a test on a website that has low traffic, low quality traffic and/or low conversions. No surprise, right?

Everybody knows that testing takes lots o’ good traffic… but testing platforms casually disregard that critical point. (Hello, Optimizely and VWO, I’m looking at you!) How on earth is a non-Amazon business supposed to “test everything” when you can barely test your home page headline and reach confidence?

Peep Laja speaking
Peep Laja Talks Testing at CTA Conf

This September, at Unbounce’s terrific CTA Conf, Peep Laja (whose ConversionXL intensive course I took early this year) recommended that we aim for 250 conversions per treatment before we shut down a test. Two HUNDRED fifty. I don’t know if that sounds like a tiny number to you, but for a little perspective: most startups would see the apocalypse before they’d see that test reach confidence.

I get it – makes total sense. You’re supposed to run your test to at least 95% confidence. To get there when you have low traffic and/or low conversions, your new treatment needs to knock it out of the park. As in, it needs to soar high above the Green Giant and land somewhere in the Atlantic Ocean. That’s the only way to reduce the amount of time a test will take to get to significance.

Which means this: you cannot test everything. “Test everything” is a myth, and I’m convinced that anyone who says “test everything” has never conducted more than 1 test in their career. Most sites can only test the very best possible treatment… but how do you know it’s the best before you even test it? I mean, the cards are really stacked against you.

Problems with split-testing calculators

I’m about as far away from being a statistician as you can get, but even I could grasp what was happening with the sites we found to have less traffic than we were originally led to believe. Depending on the goal – such as clicking on a sign up button or watching a video – either:

  1. There weren’t enough conversions per variation, or
  2. The difference in conversion rates per variation was so small that the data wasn’t telling us much of anything… other than that our copy changes were far from making the kind of impact we’d anticipated.

Thanks to low traffic, we had to abandon testing on more than one site when our month of experimentation was up. I’m sure I’m not the first person to be frustrated by this. Small, growing businesses are super-screwed when it comes to testing.

A/B Testing Can Feel Like You’re Riding a Roller Coaster
After Eating One Too Many Chili Cheese Dogs

You need to monitor your tests regularly… but not too regularly. You’ve got to check in to know what’s happening with your tests and if they’ve reached significance. But I quickly learned that checking in daily can get reallllly frustrating.

That’s because upticks in conversion lift often disappear like smoke in the rain. One day you may feel like a rockstar seeing a 20% bump from your treatment… only to find, a few days later, that you’re on the losing end of that proposition. Your test is tanking.

Of course, the opposite can happen too. A loser can turn into a winner.

What I learned is to not react to early data. Remain objective until the testing tool calls a winner, and then double- and triple-check the testing platform’s math with calculators like these ones. Remember that 80% statistical significance is no better than 50% in terms of calling a test definitively. If it’s not greater than 95% with a barely-there margin of error, the test isn’t done yet.

(To get a better understanding of why this is so important, take a look at Evan Miller’s post, How Not to Run an A/B Test.)

In the end, I had to let go of my attachment to the treatments we created. Some did well – others not so much. Which leads me to…

Why Toss When You Can Iterate?

Most of the treatments we developed were new headlines. Writing headlines is probably one of my favorite things to do. They can also be one of the toughest things to do well.

No surprise, I labored over the copy suggestions I made. Not only did I want to make Joanna proud (editor’s note: awww), but I wanted to help prove our hypothesis. Here are some that we came up with:

Luzme

Softorino

Sprout

As far as I’m concerned, there’s no reason why a headline filled with personality that packs a punch can’t skyrocket conversions… especially when it’s competing against an oh so bland counterpart. Unfortunately, it didn’t pan out the way we’d hoped with some of our tests. As Joanna mentioned in her headlines vs. buttons post, we found our bolder variations trending up but not reaching a point where we could draw any sound conclusions. Instead of abandoning these tests, she had the  idea to use the same headlines but change the corresponding call to action/button copy to make more of an impact. In Dressipi’s case, the reformulated treatment created an enormous lift, and the Softorino one (shown above: Take the Suck…) also saw lift then.

The first tests started out with a meh reaction from visitors… so we tweaked, iterated, and learned. If we had simply tossed out what wasn’t working, we would have lost not only a big win but an amazing new insight.

As a final thought here, it seems that there’s room for improvement re: the metrics we use to gauge a winning or losing treatment. A headline may be really great, but if the button is undesirable, then it looks like the headline isn’t working. But the headline is distinct from that button. So are these testing platforms really measuring success of an element… or are we settling, as marketers, for limited split-testing tools? I wonder.

Conclusions from a Summer Full of Split-Testing

It’s all well and good to write what you think is phenomenal copy. Finding out whether or not it actually gets the job done is even better. Testing tools help enable that.

But.

A/B testing isn’t the be all end all. It has its limitations and its frustrations, and there’s a lot of room to improve. So if you decide to slide down this particular rabbit hole, keep in mind what this newbie learned and won’t soon forget:

  • Patience is a virtue. Be prepared to wait, and wait… and wait some more before drawing conclusions.
  • Swing for the fences when your traffic is low. Consider making larger or bolder changes based on your hypothesis. When you’re Amazon, you can test small changes resulting in minimal lift because you’ve got the numbers to get a result quickly and even a small lift will make a considerable impact on the bottom line. For the rest of us? You might hate split-testing less if you test dramatically different designs.
  • Learnings come from wins and losses. It’s so easy to let our egos get in the way. Who wants a losing test? But if you can take what you’ve learned and improve on it, your business will be better for it.

Many thanks to Joanna for the opportunity to work with her. I’m still pinching myself.

About the author

Jen

Making more websites convert and writing kickass copy for small businesses. Creator of MakeMentionMedia.com.

  • Hi Jen,

    Don’t you think that the 250 conversion advice Peep Laja gave is kind of nullified by your later screenshot from a sample size calculator, where everyone can clearly see that there are several variables at play: level of statistical significance required, statistical power and minimum effect of interest (btw, the 75% you put in there is huge, in fact most A/B testers would be happy with a couple of percentage points increases!). No fixed number can be given as a recommendation, since it all depends on the interplay between these three. In fact it’s exactly this kind of advice that’s responsible for so many of the illusory results in A/B testing, so I’m not completely sure why you decided to include it alongside a much better example.

    I’d say your conclusions are good, but the first one is a bit worrying. It makes it sound like you’d like to monitor the data as it gathers, and make decisions on the spot. This, however, is not a modus operandi supported by the tools you showcase earlier. All of them assume a fixed sample size test, meaning you are not allowed to take any action before the predetermined sample size is reached. If this is not followed, then the numbers mean absolutely nothing of value.

    I know, this sounds so limiting and so not what you’d be able to do in a real-world business environment where everyone is pushing you for the test results, since you are losing money while running the A/B tests, no matter if your test variation(s) is performing better or worse than control.

    You’d need something like AGILE A/B testing instead. This free white paper should be a good starting point: https://www.analytics-toolkit.com/whitepapers.php?paper=efficient-ab-testing-in-cro-agile-statistical-method if you are interested in running A/B tests with proper statistical design.

    Best

  • I’ve had this trouble before. I said to a client “Yeah we’ll run some split tests and see what converts best after I’ve written the copy.” Then I write it…. and the “traffic guy” hasn’t really done his job. Sooooo split testing seems a little bit pointless. But of course the client doesn’t understand that… they still want the tests I promised. lol what a silly me for promising that in my proposal.

  • I love this article because you proved what I already knew: that small sites who can’t afford to buy enough traffic to really test well can’t really benefit much from testing. Those of us who work with really small businesses need to apply common sense and do the best we can with what they can afford.

    Back in the days when I did AdWords I could test ad creatives to my heart’s content. While often there were clear winners and losers, testing the exact same ad against itself and getting entirely different results is not good for confidence in testing. Obviously there was some variable that our tests were not controlling.

  • W. Szabó Péter

    Great article! I’m (also) quite sceptical about A/B testing, for the following 5 reasons: http://kaizen-ux.com/ab-testing-skepticism/

  • Stuart Glendinning Hall

    Or take the pragmatic approach and write great content, and post it where people are going to want to read it, as another option until split testing makes more commercial sense (like when it’s really worthwhile:-).

  • Oh boy, has this opened my eyes!

    After being convinced by all and sundry for a good while, that split testing headlines (at least) for 100 visitors should give a good answer, I am now convinced that, certainly at my small seller level, it is pretty much a waste of my time.

    Until I can guarantee thousands of visitors then I can pretty much lay the idea off to one side. Sure I can ask opinions on a headline before I start, but from now on, I’ll just write out a dozen and stick with ones that grab my attention. After all, I’m pretty much my ideal buyer IMHO.

    Thank you so much for this article. It’s one of those things that can just make your day when you read it.

    Regards,

    Steven Lucas

  • Better post I like this post and I think every body like your post. it is very beautiful.
    I know you use same good way for your blog. so you are right. have same good way you can see it..
    haddonfield public schools

  • The farther you go with split testing the more you realize it’s ineffectiveness.

    For example, in your article you mention increasing or decreasing conversions a lot. Go back through your split tests and see how those upticks and downticks in conversion translated to average order value or cost per acquisition.

    Very often an increase in conversions can lead to a decrease in average revenue per order and a decrease in conversion leading to an increase in earnings.

    The two are not tied together and it changes with every channel. What may convert well with your social channel could be bombing horribly with your paid media channel. #FunStuff

    • Justin I think you’re right – a successful design is not just about improving conversion rate. Ultimately greater revenue and profitability are the best measures of success.

      Of course, you can use split testing to optimise any metric, not just conversion rate. You could select pageviews, clicks, visits, average order value, cpa, revenue, roi or profitability.

      Which you should select depends on your business goals, the purpose of the page you are testing and the volume and quality of data you have available.

      Most businesses want to boost revenue so if they can measure the revenue generated from the page then they can use a split test to optimise that instead of conversion rate.

  • Raquel Hirsch

    What a fabulous article. I laughed… I cried… all insights: true!

  • ShanaC

    So

    You’ve been discussed by one of the people that Evan Miller discusses testing methods with.

    1) You may want to switch to a Bayesian form for an AB test. http://www.evanmiller.org/bayesian-ab-testing.html

    http://www.bayesianwitch.com/blog/2014/bayesian_ab_test.html

    http://developers.lyst.com/data/2014/05/10/bayesian-ab-testing/

    http://developers.lyst.com/bayesian-calculator/

    In some cases it may let you skirt by with less conversions. You’d still need a platform to run the actual tests from though. They are also very robust in an environment where you have an informed baseline to work off of (that includes low traffic environments)

    2) In low traffic environments, you should make up a decision rule. Basically, unless it hurts to switch, why or why not intuitively, and come up with some rules based off your intuition – at some point you will see improvement you can work off of

    3) Traffic first. It is way easier to test if you understand traffic and traffic quality. That $59 may be a better spent on bringing in traffic than testing for a startup (:-/)

    4) People are overthinking the Google Penalty. Google will over-reward if you increase traffic in legitimate ways. legitimate uses of testing falls into this category. Properly implementing site testing tools so that your site does not radically slow down will not lead to a google penalty. Getting good quality backlinks and driving traffic while doing a test will help you though!

    • Joanna Wiebe

      I’d love to learn more about the Bayesian style (??) of testing. I’m usually the person saying, “Test this copy” rather than the person saying, “We should run the test this way with that methodology and X, Y, Z.” I’ll be reading through your links and keeping my eyes open, Shana — thank you.

      I’m wondering… if a Bayesian test will let you run a test with fewer conversions, why don’t popular testing tools offer this approach… or do they, and I just haven’t noticed it? (Perhaps Google Experiments does, but I rarely go in there. GE and I don’t get along. 🙂 )

      • ShanaC

        Short answer.

        The bayesian school of statistics might lead you to needing less conversions. It also might not. Where the needle isn’t being moved a lot, you still would need lots of trials – but if the needle is being moved, you will be able to see it sooner and stop the test when you want. (Basically you don’t have the false p-values thing happening) (or you could stop it and say, the needle isn’t moving and this is pointless).

        There is no straight up pro-sumer bayesian AB testing tool on the market at the price point you are looking at. You also have to know about bayesian statistics to even want to test this way. Furthermore, business schools and basic statistics courses don’t teach it either. (No demand)

      • Doesn’t a Bayesian approach give you a different type of answer? Rather than there is only a 5% chance that the treatment result is a fluke, they give you a % chance that the result is better.

        And unlike the traditional approach, you don’t need to decide how many results you need to get in advance.

        Which means that when your deadline is up you can go with the balance of probabilities, or decide to test for longer, rather than simply abandoning the test?

        More flexibility is probably better for startups. I’m sure optmizely or VWO could build it in if they wanted to – it’s just not see by them as something that will get the more customers, sadly.

      • ShanaC

        You need to have a threshold of caring in a bayesian test. Bayesian tests do give a different type of answer as you describe – but the answer is described as distributions of probabilities. The a arm and b arm could still overlap a lot after 100 trials. Or they may not. The goal is to get them to separate enough. With informed priors they won’t recross unless the test population changes, but if they do you have the same problems as p values

      • Thanks, that makes sense.

        I guess I’m thinking that if you are forced by circumstances to make a decision at a certain point, you might as well go for the one with a better mean conversion rate, even if the difference is small, because on the balance of probabilities that will be the better decision.

  • Aaron Orendorff

    Stellar take-away right before the conclusion: “A headline may be really great, but if the button is undesirable, then it looks like the headline isn’t working. But the headline is distinct from that button.”

    What a holistic approach to CRO!

    And you’re totally right, how difficult that is to test.

    Huge kudos to you and JoAnne. I remember reading that original post and thinking: “Genius!”

    Oh, and I had to laugh at this gem: “I guess it’s better to save $59 than it is to learn and grow???”

    Crazy.

    Really great article.

    • Many thanks! Jo added some spot on editorial flourishes to my post which helped give it a little more zing.

      Keeping those original edgy headlines that weren’t quite doing us proud then adding amped up button copy, all her. That was a great lesson – don’t toss but iterate. Loved seeing that in action. She’s truly amazing at this whole copywriting thing.

    • Raquel Hirsch

      I tweeted that!

      • Aaron Orendorff

        Well, you know what they say about great minds … 😉

  • This post is so on point for my current company. We’re so early-stage that it’s hard to get significant traffic. The good news is that we’re testing big differences and we’re seeing relatively large differences in conversion rates so far. It’s still early though. Thanks for the great write-up and useful links throughout.

    • Glad you found this helpful! Yeah, you’re probably in that “swing for the fences” stage. You can only do what you can do – as they say. Good luck!

  • Raj

    I just saw the pop-up you were talking about in the newsletter and honestly its a bit what is called as being – Right on the face. Infact i had to close the pop up because it was suddenly so bright that I had to a 6 second headache. Literally.

    • Joanna Wiebe

      Yeah, it’s an experiment we’re running. We’ll see how it goes! 🙂

      • ShanaC

        How are you running the experiment (there are a bunch of ways to run it, because it can tie into a bunch of metrics, some of which will go up, some of which will go down – and I’ve seen popups go both ways)

  • Fedja

    It really is that simple, if you don’t have the traffic you can’t do solid testing. It’s the same in any survey scenario, I’m fascinated that people expect A/B tests being any different than surveys.

    You can’t come to a conclusion from a survey filled out by 4 people, and you can’t decide if your product fits the market based on 2 reviews. Even when we look at Amazon or Yelp reviews, we instinctively disqualify the ratings aggregated from one or two reviews.

    As a sidenote from someone who had the opportunity to play with A/B tests on a site which did have plenty of traffic. It’s just as important to understand the broader implications of your tests. I’ve had button and form tests lift conversion of a landing page by 45%, and when the same changes were applied to an identical page, nothing happened. The customer journey to that page was different, and the page had a much smaller “convincing potential” in that context.

    Test the right things, not everything. Sounds easy, but it took me months at that one company to make educated guesses, and simply testing everything would have made me give up in frustration if I rushed into it.

    • Yes, just having traffic doesn’t necessarily solve all your problems either. You still need to have a solid plan and make sure you’re looking at the bigger picture. In the end, you may be getting more conversions but not making up the difference in revenue.

  • Excellent post and exceptionally sensible. As you correctly observe there is a risk of a knee-jerk reaction when dealing with low traffic volumes. This can oh-so-easily be reinforced when we’re hooked on checking Google Analytics every 20 minutes expecting to see something happen!

    I like to borrow from Nassim Nicholas Taleb in these situations: disengage from the data, let the experiment run and tap in every couple or three days to see what’s happening. You’ll have better data AND sleep better at night!!

    • That’s what I had to do – not look at the tests for at least 2 or 3 days at a time. Otherwise, I started to drive myself crazy.

  • Nice post, Jen! You have eloquently described much of the pain I’ve experienced helping companies optimize their websites.

    “Test everything” is just like “Don’t put a plastic bag over your head”. Oh, hang on…

    I think “test everything” has good intentions, but we know where good intentions often lead. 🙂

    • Many thanks. I didn’t realize quite how frustrating it could be. There’s so many moving parts. I sent Joanna more than one email just asking “Why is this happening?”

      And, yes… the road to Hell is paved with one too many good intentions. I’ve trundled down it myself a time or two.

  • Jonathan DeVore

    I love it when people write what I think (but I’m too nervous to say). Sometimes I don’t want to admit embarrassing facts like “I don’t have a lot of traffic” – so I don’t speak up. I just go with the flow.

    • Don’t feel bad. Most businesses don’t have the traffic. That’s why this testing stuff is so hard.

  • Cambridge SEO & Web

    This is a really great article Jen, we tweeted it on our channel. I remember A/B testing a while ago and finding the same issues – it’s really difficult to know if small traffic flows really yield good results, and you begin to doubt the whole process! Anyhow, thanks again, great read!

    • Many thanks! Joanna is a good editor too.
      It is so difficult. I had only done a couple of tests before so really had no good experience. It was phenomenal to see so many tests running at the same time and get a feel for what you need traffic wise. Even though it was a bit up and down, I’m excited to keep with it.

  • This is great stuff, Jen! Thanks for writing this up. Your point about swinging for the fences when testing low-traffic sites definitely gives me a lot to think about. I’ve heard “test everything” from countless people. And while that might work if you’re getting ridiculous traffic, with smaller sites we have to pick our battles. Thanks.

    • It’s so true. Between this great experience working with Joanna and learning from Peep, swinging for the fences is what you’ve got to do when the traffic is low. There’s so much that goes into all of this. “Test everything” is not a good answer.

Copywriting tutorial

We built a million-dollar business on blogging

Amazing blog posts build businesses and print money. Now Copy Hackers is teaching indies and teams to write kick-ass posts in half the time. Get notified when we're live.

Unsubscribe anytime. 100% privacy. Powered by ConvertKit