How to prioritise A/B tests

by David Mannheim

posted on: December 16th 2015

For every client we might generate, initially, anywhere between 100 and 200 experiments. That's a lot of experiments to prioritise and we could be testing for months if not years without some level of prioritisation. Our values are all about a return on investment for our partners; the quicker the return on investment the more we can all breath easily (although should not stop us from working just as hard).

As a result, how do we prioritise these experiments?

The answer isn't terribly scientific. If it were, and we knew exactly what to prioritise because we knew what we should test next, we wouldn't have to test at all right?

We adopt a 3-staged prioritisation model that we'd like to share with you.

1. PIE Model

The PIE model is one created by the guys at Wider Funnel and we think it's fantastic. It stands for Potential, Importance and Ease. However, we do change this slightly to Potential, Impact and Ease - predominantly, and yet to be convinced otherwise, we believe that having a subjective view on how you believe the experiment will perform is important. This is a very subjective way of prioritising test however, it is bourn out of experience and a great understanding of your user. Whilst we're told to remove 'all subjectivity' out of testing, I strongly believe we remove instinct in conjunction; and instinct is what pushes us forward and drives creativity.

  • Potential is the potential the page has; where are your worst performers? For this we need to understand our analytics and we recommend sorting your page content by weight (see how here) against bounce rate, exit rate and page value. Even then, generally speaking, in an ecommerce world the average conversion uplift per test come largely from the product page (9.28%) followed by the basket (6.59%) - does this state that we should be testing the product and basket pages first? It all depends on your data, not the data of others.
  • Impact is what you believe the potential impact of the test will be. There is a slight cross over between potential and impact, but the latter allows you to demonstrate your subjective understanding of how you believe the experiment will perform. For example, we might believe, after a strong understanding of our user behaviour, that adding details around free delivery within the checkout to match expectancy is more important than, say, adding some trust logos in the footer - based on our understanding of user's motivation towards trust and expectancy.
  • Ease is quite simply the ease at which experiments can be implemented. Generally this will be the technical implementation but also worth while considering the political barriers involved and any administrative work required pre-test (getting content for example or some data from the back end system)

From here, each attribute is given a score out of 10 and then divided by 3 to give an average. As a result, we might see the potential of an experiment as a 10, the impact as an 8 but the ease as a 4 (a difficult implementation) and the total score for that experiment would be 7.3.

2. Attribute Value

Somewhere, someone has stated that obtaining a 5% uplift in conversion rate for an experiment is a "good, successful A/B test". I'm not sure who that person is or where it's come from (if you know, please let me know...) but regardless, it's some form of arbitrary benchmark to work towards. We use that benchmark by assigning value to our experiments.

What would happen if this experiment achieved a 5% uplift in conversions? This is particularly useful in the scenarios of:

  • Experiments with the same PIE score eg. 7.3 and 7.3
  • Giving clients some level of expectation and / or forecasting particularly to report to their seniors (note: further caveats are undoubtedly required in these situations)
  • Forecasting

Let's take an example of adding urgency to a call to action, by what ever means. Within analytics we're able to determine that 5,000 click on that call to action per month and, from clicking on it, it generates a conversion rate of 3.45% giving a total monthly revenue of, for ease, £6,000. We can find all this data out within Google Analytics by:

  1. Seeing how many users click on that event (5,000)
  2. Creating a custom segment from that event using event label or similar and then viewing the ecommerce conversion rate (3.45%)
  3. Assign a further 5% uplift to the applied conversion rate (3.62%) assuming the same amount of users will click the call to action, as these users now have a higher propensity to purchase. Granted this ignores any potential uplift generated from the 5,000 clicks which "should" increase - or so your hypothesis states

The 5% benchmarked value of this experiment therefore being worth an additional £295.18 per month, which, over 12 months, is £3542.16.

3. Behavioural themes

What are we doing with experiments? We're affecting user perception or user behaviour. If we're not affecting either of those two parameters we're not testing. These parameters are all encompassed under "user motivation" - what is stopping our users from doing X? What is motivating our users to do Y?

We develop what we call "themes" with our users. These are behavioural themes that affect the perception or behaviour of that user in a positive or negative manner. We generally assign 6 - 10 themes to a website and label each experiment with that theme(s) that it will be affecting.

For example, Booking.com would have themes such as urgency and scarcity. AO.com might have behavioural themes of value and social proof and so forth. UserConversion.com might have themes of authority and trust, based on the fact that we're a consultancy and relatively young in the digital landscape.

What we'll see is that as we experiment with hypotheses, some themes will outperform others. We might find in our historical backlog of tests that urgency is a bigger motivation for our users than, say, scarcity. As a result, we know, going forward, that these behavioural theme has a higher propensity to affect user motivation. Therefore, we will prioritise accordingly.

Woah! There's more! **Updated 04/10/16

4. PXL Model

A great (new!) model on how to prioritise experiments based on a binary code. Conversion XL discuss similar flaws with the PIE model (and ICE model but that's less common) which is basically that it's subjective. To remove that subjectivity, take a look at the PXL and the article on their website here.

Checkout out the spreadsheet here

Some further resources:

avatar for author

David Mannheim

David is an experienced conversion optimiser and has worked across a series of core optimisation disciplines including web analytics, user experience and AB & MVT testing.

How to prioritise A/B tests

by David Mannheim Time to read: 4 min