With the proliferation of A/B testing tools has come an era of experimentation. This is fantastic in a sense that we are embracing the need to improve and iterate our online businesses through experimentation. However, with it has come a wave of tactical, low-level testing (need I mention button colours?) where A/B tests are thrown to the wind and placed live without reaching significance. With that there is an assumption that A/B testing is easy. Please let me assure you it is not.
We've put together a list of 10 questions to ask yourself to see if you're A/B testing properly based on mistakes we see again and again.
What is "proper" is subjective but we certainly adhere to these guidelines and recommend that you do too.
Are you validating your hypothesis?
This should be a no-brainer, but you'd be surprised at how many times we hear "I think this will work" or "I saw this on this site and liked it" or "Why don't we try...". These are all well and good responses if they are pre-validated but the truth is for all those cowboys out there, generally, they are not. By 'validated' I mean understand the impact that the test could have and / or whether it's a worthy test to run.
A1: Good A/B tests are researched well, have a clear hypothesis and are documented before and after. #optichat
— Hiten Shah (@hnshah) December 1, 2015
Let's take a recent example. We had a client that suggested to change the default within the 'sort by' dropdown on the category listing page. Firstly, it's a worthy hypothesis because it's not just a usability improvement but it affects the behaviour of the user and, even, their perception. Secondly, we need to validate such a claim by asking how many people use this sort by function? Specifically, how many use it and navigate to the recommended element (in this case price) to change it to to denote some understanding of 'need' for that element? What about the impact of those users that do change this sort by function to the recommended element (price)? Are they more inclined to purchase than the previous element (popularity)? What is their AOV and how could it affect the bottom line? Is the exit rate higher for these users or lower? Is there any qualitative insight from the voice of the customer to suggest that this is a worthy experiment? Are the previous answers to these questions different across different devices or user types? These are the types of questions to ask when trying to validate a hypothesis.
In the example above, we found that the control default sort by function was popularity of products. Users used the sort by feature to change their default listing 1% of the time specifically to price low to high 0.5% of the time. Of those users that selected this element, their propensity to purchase was 25% less. This was qualified within the qualitative analysis which also identified that price was the least influential factor in the users decision making process.
Don't get me wrong - we're not trying to answer a hypothesis through data. We could source all the data and insight in the world to validate a hypothesis but that's all it ever will be - data and, as such, an educated assumption before putting it into practice. No, what we're actually trying to do is ask "OK, is this experiment worth running? If so, how should we prioritise this experiment? Is it a high priority - based on your validation data - or a low priority?" As a quick checklist, ask yourself these questions when validating your hypothesis. Summary: All hypotheses should be validated by some level of data to prove their worth and prioritise them accordingly
Have you researched your hypothesis?
That all being said, before you even get to that stage of validation, you should be researching before you get to your hypothesis. In the example given above, theoretically, we shouldn't have reached the suggested change of testing the default within the 'sort by' dropdown on the category listing page, that experiment idea should have been born from conversion research.
"If you ask “what should we test next”, you have no idea what you’re doing. Do conversion research!"
Here's a nice article on how to come up with more winning tests using data by Peep Laja where he explains the steps involved in their Research XL framework. This is not to dissimilar to our own User Conversion framework (most optimisation foundations are the same) where we identify a mixture of quantitative data, qualitative insight and past experience within our discovery period to indicate what experiments should be prioritised. Only through effective conversion research will you understand what experiments should be run.
From here, naturally, you'll see patterns and themes. These themes are what affects our users motivation and experiments should be assigned to each theme. By assigning them to a theme, the success and / or failure of an experiment can be attributed to a theme and said themes can be prioritised in terms of success rate. For example, only through effective user research do we understand that a key motivator for users of a health and beauty client is "convenience". From here, we have hypothesised various experiments, and validated them, under this theme and tested it's impact and success rate - of which is the highest amongst any other theme.
Summary: Hypotheses without research are just hunches and guess work which will lead to limited results
Are you testing perception or behaviour?
In a recent blog post, we evangelised whether you're testing perception or behaviour? Because if you're not, you're not testing.
Most tend to test more tactical elements that would only affect usability. As optimisers, we don't need to test 'everything' just for the sake of testing. In fact, we need to be ruthless about prioritising our experiments and improvements to maximise return on investment. Ask yourself, if you're testing so-called subjective usability improvements why this is so? Is it because something is broken and needs fixing? If so just fix it.
There are, arguably, those experiments that can be classed under, what Peep Laja from Conversion XL calls, JFDI ("Just Do It") (with perhaps an expletive in there). This classification is for those experiments that are, what he calls, 'no brainers'. Now, we don't really believe in no-brainers where we are, but we do believe in usability issues or bug fixes. The two are slightly different.
Testing really comes into it's own when we mould and affect user behaviour or user perception. That's where we get our large gains. Testing tactically and only improving usability, we'll see smaller returns. Alternatively, testing can also be used to great effect when we're testing iterations of a usability improvement.
We generally waver on the boarder of subjectivity when we discuss this but ultimately, ask yourself "am I going to affect user behaviour or the perception of the user with this potential experiment?"
Andre Morys spoke about four different questions to ask yourself for each experiment to determine it’s ‘worth’. Those were: is the variation bold enough that people will notice it? Will the test affect the behavior of users? Is the page part of the sales funnel? Am I using motivational triggers?
Summary: If you're not testing a change in perception or behaviour, you're not testing
Do you have a QA process?
You need a QA process. There is nothing worse than finding out a test failed because it broke and hasn't gone through QA effectively. A good QA (or quality assurance) process will capture all these bugs, even potential bugs, by catching and correcting these errors before they rear their ugly head. As a result, it's an imperative part of the experimentation process.
"Over 40% of AB tests I've worked on were broken (some seriously)"
Is the page flickering? Are the goals set up correctly? Is the experiment showing in the same session? Is the experiment showing across multiple devices correctly in a responsive nature? Is the experiment targeted to the right group of users? Is the CSS broke? How does it look across multiple devices? What about multiple browsers? Or multiple screen resolutions? There are so many ways an experiment can be messed up and therefore passing the experiment through a robust quality assurance checklist / process is essential.
Manuel Da Costa gives us some examples of how to QA an experiment here which is very useful (thanks Manuel!). Passing your experiment through a myriad of testing scenarios can, and will, improve the validity at which you are testing. There are multiple ways of QA'ing (which I won't go through now) but we recommend using tools like Browser Stack to test cross browser and device testing and tailoring your QA process to suit the needs of the business. For example; those experiments that are on the checkout should be QA'd with a lot more integrity than, say, the homepage as it's more conversion affecting.
Summary: There's nothing worse than a broken test so make sure it works 100% before going live by testing it
Are you testing for long enough?
We need to understand the level of significance within an experiment. Too much have we heard "the experiment is winning by xx% let's put it live". Statistical significance is a cruel, but necessary, mistress which identifies whether variation A is actually better than the control, statistically speaking. We are, at the end of the day, running a science experiment - Peep Laja has a great article on statistical significance that I throughly recommend.
There is the flip side of this that states that this is just 'risk value' in that the significance of 80% suggests that we are 80% certain that variation A will outperform the control. Therefore, placing that 20% of 'risk' is at the business' peril. Too often, however, have we seen imaginary lifts where what people thought won, actually didn't. As a result, when placed live, we see very little, if any, impact.
Also, don't forget that conversion rates vary greatly dependant on the hour of the day, day of the week and week of the month. As a rule of thumb, conversion rates are generally higher towards the end of the month, after pay day, than the middle for example. In addition, we can also throw, not just seasonality into the mix, but external variables beyond our control. The weather, for example, is a notorious influencer in ecommerce especially.
As a result, when we talk about 'are we testing for long enough' it's not just the level of significance that we need to satisfy (recommended at 95%) but also ensuring that we run experiments, at least, for a full week - ideally for 14 days at which point we can make a judgement based on data.
[blockquote color="#7ecec7" bordercolor="#7ecec7" author="Matt Gershoff, Conductrics "]One of the difficulties with running tests online is that we are not in control of our user cohorts. This can be an issue if the users distribute differently by time and day of week, and even by season. Because of this, we probably want to make sure that we collect our data over any relevant data cycles. [/blockquote]
Another question to ask is how are you calculating your statistical significance? There are various tools to use and some are more low level than others. For example Get Data Driven is a tool that identifies the significance between A and B alone where (our personal favourite) http://abtestguide.com/calc/ looks at the visualisation of P -value which is extremely useful.
Summary: The higher the significance the lower the risk - it's likely that some tests won't reach significance (especially if you don't alter user behaviour or perception). In this case, it's down to you to make a smart decision on which variant to implement (hint: let the data guide you!)
Are you reporting on 'more' than conversion rates?
Let's say you place an experiment live on the homepage; what are you testing? We would assume conversion rate (as the staple metric for all A/B experiments) and perhaps the basket and checkout.
Hold on. The homepage is technically a none (or lesser) converting page than other pages further down the funnel. Your experiment could in theory affect a huge range of behavioural outputs of the user. What about the amount of users that reach a group of pages (eg. product)? Those that click on a series of events? More vanity based metrics such as bounce rate or time on page? The results against mobile or tablet users? New users vs returning users? The results of users in a specific hour of the day? The level of analysis is, in theory, endless. Where your optimisation tool will give a limited picture, it is a limited picture of averages. To truly understand the analysis of your experiment you must dive deep into more specific data points; and as you do so your significance is reduced.
In addition to that, most A/B test tools will only record the goals that you are tracking after you've told it to. It's quite common to ask questions after the experiment is running about data that you're 'technically' not tracking. When you pull this data into Google Analytics you are tracking everything already, so it's more of a case of matching data points instead of missing data points.
When analysing tests, the majority of conversion points are not URL defined, but events. As a result, using effective event tracking is highly recommended within your data configuration. If you use our closest ID tracking in GTM you'll be able to mop up all events on your site, and with the data pulled in from your experiments, you'll be able to track everything.
Let's also not forget that users are 'bucketed' into an experiment by the A/B testing tool. This does not mean that they have seen the experiment, or even that they have used it. An example being that we've tested the implementation of an iteration of a homepage product wizard. For this experiment, we had to understand the amount of users that 'saw' that experiment (i.e. homepage users) and then those that used the homepage wizard using events. In this instance, the homepage wizard converted at a higher rate than the incumbent, but less users 'used' it and instead opted for the other navigation elements on the page such as the navigation and search bar. We wouldn't have known this information had it not been for integration into Google Analytics.
This can be quite a common mistake when A/B testing and reporting accurately.
Summary: Integrate data into Google Analytics to cross-pollenate data sources across different events and types of conversion points. It's important to visit the behaviour or perception you wanted to influence in the test and ensure that it did indeed satisfy the original hypothesis.
Are you reporting on 'more' than quantitative data?
Think about what you're affecting in an A/B test. You're trying to mould user perception. Do you therefore think that quantitative data is enough to work from? Can a mere 4 or 5 goals in your A/B testing tool tell you if you have affected the perception of the user? Probably not.
As a result, it's important to try to source qualitative feedback from your A/B tests. This isn't just so you report on a test output more accurately, but also to understand the test results better. About 70% of our experiments are iterations of an initial hypothesis. Once we understand why something won, we can generate better and better and better results because we're playing on an iteration of theme that positively affects a user's motivation to convert. That's why we're all here, right? To improve? Qualitative feedback gives us that level of understanding; known as 'voice of the customer'.
With this in mind, reporting on A/B tests from a qualitative standpoint is vital to truly understanding why an experiment won or lost. To do this we recommend integrating your experiment with other feedback tools to understand 3 x facets of the user's thought process:
- Heatmaps will show you where the user clicked, hovered and scroll on your variation
- Screen recordings will show you exactly the path that the user took
- Remote user testing will demonstrate the perceptual feedback of the user as they complete activities that you set
We wrote an article how to integrate your experiments into HotJar to source the heatmaps and screen recording data of your experiment. We generally also place remote user testing on our experiments, too, because we have our own remote user testing tool.
Summary: Integrate data into other optimisation tools to get more than a data perspective view and truly understand how the voice of the customer has been affected
Read more: How to integrate HotJar and Optimizely
Do you embrace failure?
And on the above point of iterations, do you embrace failure? We learn way more from our failures than from our successes right? Too often do optimisers give up on the first experiment. What did Thomas Edison once say? "I have not failed. I've just found 10,000 ways that won't work.".
Subjectivity, passion and bias can get in the way of success with your subconscious suggesting that that test "should have won" and therefore you're positive mental attitude will either a) not delve into the understanding of why a test failed or b) prove to yourself and others that it won despite the data suggesting otherwise. After all, I'm certain you could find some data to suggest any test won.
As suggested above, about 70% of our experiments are iterations of an initial hypothesis. If a test fails it's not necessarily because of our hypothesis but the execution of the hypothesis. Yes, the hypothesis might need some tweaks and improvements but once we understand what users are doing in practice then we can truly understand motivations and behaviour. Let me put it another way, we wrote an article on how to generate hypotheses for experiments and to be wary of assumption. In this we stated that all the research and discovery in the world is still only theoretical. We're still making assumptions, albeit well educated assumptions. When you experiment, you are testing your hypothesis in a practice. There is no greater proof than a practical experiment, not theory, where data suggests that X did Y, instead of assuming that X would do Y.
There's always a way to prove a point using data, and if you're using data to prove a point, then you're doing it wrong. With every test have more than just one person analyse the results, and if one person says a test works and the other says it didn't, it probably didn't (and the person who said it did, probably wrote the hypothesis!)
Take a look at Content Verve's "how a negative test result produced a 48.6% lift in conversions on a B2C landing page" for example. The same can also be said for experiments that win - don't stop there - learn from them! Take a look at Wider Funnel's example for the DMV here with their surprising results after 3 years of the original experiment.
Summary: Embrace failure as a means to learn; don't let bias, proudness or passion get in the way.
Do you do AA testing?
AA testing validates the testing source by ensuring that you get the same result from each variant. It's also commonly known as the comma test where you place a comma in a sentence to see if it makes a difference - in theory the answer is no (how would a comma affect conversions?). In this instance, we send 50% of the traffic to A and 50% of the traffic to B, where both A and B are the same (hence "AA").
This system has come under some scrutiny, with Craig Sullivan calling it a "waste of time" (or, at least, done by most companies out there) because of the efficiency at which you test. It's actually a brilliant article I thoroughly recommend you read it. For example, would you rather use up a 'testing slot' by testing the same thing, or by testing something that could actually make a difference to your bottom line? He's got a point. Instead he recommends that you just QA the test thoroughly and then there should be no statistical impact in the data.
This is a reason why, when we do AA tests, we always incorporate a "B" variant in them that is different and potentially conversion affecting. We are therefore testing, not just whether the data sources are validated and statistically measurable, but also whether our B variant is a winner - like any other experiment. @danbarker agrees.
In addition, another less common version of AA testing is by using 2 x analytics sources. As you can see from below in 'are you reporting on more than conversion rates' we also recommend setting up with Google Analytics. In doing so, you're getting the benefit of validating your conversion points against both Google Analytics and your A/B testing source. Now there will be some discrepancy as, for example, GA measures on sessions where Optimizely measures on users, but ultimately the results should be the same or similar. In this, if B wins by x% in Optimizely, in theory it should win by a similar amount in Google Analytics.
You also don't need to run A/A tests all the time. We generally do it as a cleanse or sense check every quarter or sometimes on complex tests; the perils of which, still remain as identified by Craig Sullivan.
Summary: Use AAB testing and alternative data sources to validate data measurement
How are you measuring your success?
Measuring success is not just about measuring the success of the website's various metrics, but measuring the success of your metrics too.
What are you measuring? Is is just conversion rates? (we hope not based on the above 'report on more than conversion rates) Andre Morys at WebArts.de claims that tracking the wrong ecommerce KPIs is one of his biggest mistakes; and it’s not necessarily about conversions but about the bottom line. He sought to validate this theory in an experiment where he found that variation 1 (a focus on discount) saw a 13% uplift, -14% bottom line where Variation 2 (a focus on value) saw a 41% uplift, +22% bottom line.
As a result, variation 1 lost although the testing tool reported an uplift. The uplift in conversions for variation 2 in conversions was much bigger than the “real uplift” in bottom line. This is why Andre recommends everybody to do a cohort analysis after you test things in ecommerce with high contrast.
"Conversion Rates Are Only a Leading Indicator of Success"
In addition, we recommend measuring the performance of you're own efforts as an optimiser. How many experiments do you run per month? (or number of tests run in the last 12 months)
The percentage of tests that deliver an uplift? What is the average % uplift per test? What is the percentage of successful tests that deliver over 5% CR increase? As a rule of thumb this figure of 5% is the industry benchmark. I’m personally unsure where it came from but people in the industry do use it as a metric to benchmark against.
Summary: Test your program as well as your results
In summary, there is a lot to consider when A/B testing. With the proliferation of A/B testing, the ease at which these tools make it within WYSIWYG editors, success stories preached by the likes of WhichTestWon, and the fact that companies are more understanding of the value of CRO - there are in turn a lot of mistakes that are made.
The above are what we consider 10 questions to ask yourself to sense-check that you are, indeed, A/B testing properly and not just another cowboy... there are actually many more questions to ask as the subject is so broad but here's our starter for ten.
If you're looking at hiring a conversion optimisation agency, odds are they will consider the above or some variation of the above (you'd like to think so!). We have a handy little guide here on how to hire the right conversion optimisation agency.
- Why AA testing is a waste of time, Craig Sullivan
- How to Run More Effective A/B Tests: An Optichat Recap, Optimizely
- How to come up with more winning tests using data, Peep Laja
- The Ultimate Guide to A/B testing, Para Chopra
- A/B Testing Mastery: From Beginner To Pro in a Blog Post, Alex Birkett
- How to Analyse your A/B results with Google Analytics, Peep Laja
- 5 insights from every speaker at Conversion XL, Peep Laja
- How to QA your A/B test, Manuel Da Costa
- Is Conversion Rate Optimization a Dead End?, Bryan Eisenberg