The Good, The Bad and The Ugly of A/B Testing

On the one hand I think it’s a very powerful tool that can help designers cut through opinion battles, test hypotheses, and get the most effective solutions into the hands of customers quickly. However, we also run the risk of removing human judgement from the equation, prioritising company KPIs over customer needs, and iterating towards local maxima.

A chart showing of the dangers involved in iterating towards a local maxima. — An Explanation of the dangers involved in iterating towards a local maximum. Also known as "putting lipstick on a pig."

The Benefits of A/B Testing

The benefits of A/B testing are fairly obvious. You have a couple of simple design or product improvements you’re looking to make, but aren’t 100% sure which will yield the best results. One of your team feels very strongly about one direction, one of your stakeholders feels strongly in the other direction, and the rest of you can’t be sure either way.

In these sorts of stalemates, politics invariably comes to play. Who has the most sway in the organisation, who is the best at arguing their case, and who can’t you afford to piss off? If the person in question is particularly good at arguing their case, could they argue the counterpoint just as convincingly? In which case, is there really a clear and obvious solution?

In these situations two things can happen. Either one person gets their way, leaving the others feeling frustrated that their opinions haven’t been heard, or worse, no decision gets made and this conversation is set to repeat itself ad nauseam. I’m sure we’ve all been there.

A good example of this sort of decision is the humble ratings widget. Imagine you’re in a product team trying to decide whether a star or a number system would be better. If you go for the star system, should you use four, five or six stars, and should you allow only portions of a star to show? If you choose a number system, should it be a 5, 10 or 100 point rating, and will you accept decimals?

You could do some desk research to see if there’s a research paper on the subject, but how do you know this will work with your particular audience? Or you could simply copy what your competitors are doing. After all, they’re bigger than you so surely must have done the research?

This is the situation that Sim Wishlade from OpenTable found himself in recently, and he writes eloquently about the research they undertook and the finding they made in a Medium article entitled, When 3/5 doesn’t equal 6/10. As you can imagine, the obvious solution isn’t always the best solution.

When Opinion Fails

As Designers we see ourselves as experienced decision makers with a keen sense of what works and what doesn’t. Repeated A/B testing goes to show that we’re not quite as accurate as we’d like to think. We like to explain that customers are terrible judges of future behaviour, without realizing that we fall prey to the same biases. It turns out we’re actually pretty poor judges of future user behaviour.

In his talk from The Next Web back in 2013, Dan Siroker explains some of the tests he carried out for the Obama Campaign. He poses a simple question to the audience. Which of these videos, which of these header images and which of these calls to action proved to be the best. As you can imagine, the audience got the answer spectacularly wrong, as did their own designers.

Why Don’t Designers Test More?

If even the best designers make these sorts of mistakes, why aren’t we testing more? I think there are a couple of answers to this.

The first answer is a process and tooling problem. While it’s relatively easy for an individual designer to initiate a quick usability test independently, A/B testing requires tooling and coordination. You need to pick an A/B testing platform, you need your dev partners to implement it, and you need to have both the time and capacity to do something with the results. As you can see, there are a lot of places where something like this can fall down.

For a start, picking and implementing A/B testing solutions takes time and money. So who is going to be responsible for evaluating and procuring this system, who is going to manage and implement the system, and who is going to pay for it? Most product teams find it hard enough shipping their existing backlog, without adding more work for themselves.

If you do manage to put a testing framework into place, implementing tests will require coordination between your design, engineering and analytics teams. I’m sure QA will also have something to say here. Considering the relationship between design and engineering can be challenging at the best of times, is it any wonder that designers prefer to stick with a more qualitative testing approach they can manage largely on their own?

The other problem is one around power and self-image. Designers have always been at the bottom of the pecking order when it comes to product teams, as they neither have the muscle of the engineering teams (in the form of headcount), or the influence of the product managers. So designers are very conscious of anything that may devolve further power away from them.

Admitting that they don’t know the answers to even basic questions like “should we use star or number ratings?” can undermine what little power and status they already have. Handing some of that power over to a more analytically focussed team can be even more challenging, especially if recent history is anything to go by.

41 Shades of Blue

No, we’re not talking about some new E.L. James series here. Instead we’re talking about an incident that happened at Google many years ago, and has subsequently gone down in designer folklore. In the early days of Google, design didn’t yet have a “seat at the table” and designers found it difficult to thrive in an environment dominated by testing. Every small decision needed to be tested and validated, including the precise hex value of a button. In a now famous article, designer Doug Bowman explained how their testing culture had driven him out.

"Yes, it’s true that a team at Google couldn’t decide between two blues, so they’re testing 41 shades between each blue to see which one performs better. I had a recent debate over whether a border should be 3, 4 or 5 pixels wide, and was asked to prove my case. I can’t operate in an environment like that. I’ve grown tired of debating such miniscule design decisions. There are more exciting design problems in this world to tackle.”

This article resonated with many designers who felt that they had to prove every small decision from first principles, and their agency as a designer was being diminished as a result. While Google later justified their decision, I can understand this point of view. It often feels like questionable decisions get pushed to production by other departments with little or no due diligence, but when a designer wants to make even a small change, they have to provide incontrovertible evidence that it’s the right thing to do.

Building a Culture of Experimentation

The key thing therefore is balance. Creating a culture where A/B testing isn’t used as a battering ram to win arguments, and ensuring the right things get tested.

One company understands testing better than most, and that’s the folks at Booking.com. In his excellent Google Conversations talk, then Director of Design Stuart Frisby, explains how he went about creating a culture of experimentation amongst his team.

The Potential Pitfalls of A/B Testing

While designers should be doing a lot more A/B testing than they currently are, there are some pitfalls to be aware of. Not least the idea that every solution that performs better is therefore the right solution.

It’s easy to see where this attitude comes from. Teams are taught that it’s their job to improve customer experience, and that the way to do this is to optimise for specific KPIs and OKRs; usually relating to acquisition, retention and customer satisfaction. So teams set out to move the dial, safe in the knowledge that if they hit their targets, everything will be okay.

You can see a glimpse of this mindset in this excellent story from Anna Blaylock at Netflix. On joining the Netflix team, Anna wondered why the product didn’t let prospective customers see all the content on offer before signing up. From a designer’s perspective, this seems like a sensible thing to do, and so Anna set about devising a test.

As you can probably imagine, the test came back negative. It turns out that more people sign up to Netflix if they don’t get to see everything that’s on offer in advance. Anna likens this to the difference between reading a restaurant menu and eating the food. The experience of using Netflix is so much more than just the list of films and TV shows on offer; anything else is a distraction. If I worked at Netflix, I’d probably have a similar take. I'd know the product was amazing, in part because I built it, and so all I needed to do was remove as many barriers as possible, to let others fall in love with it as well.

However I believe there could be another reason why registrations went down. What if some users wanted to make an informed decision about the content and decided that there wasn’t enough value for them? Maybe they were looking to see if their favourite show was on that particular platform, or they were trying to judge whether Netflix had more of the content they liked than Amazon? So maybe showing users the catalog before singing up was the “right” thing to do, even at the expense of sales.

I think this is one of the challenges with A/B testing. It’s easy to assume that optimising around a specific metric is inherently good, and do everything in your power to hit those numbers.

Now this particular incident is fairly innocuous, which is why I picked it. However I do believe that when unchecked, A/B testing can lead to serious problems.

The World We Live in Today

It’s safe to say that things are a little crazy at the moment. Global pandemics aside, all sorts of fringe beliefs seem to be gaining traction of late, from flat Earth theorists, to people burning down 5G masts, and much much worse.

While a lot of people blame “The Algorithms” what they’re really talking about is an extreme form of multivariate testing. Seeing what content is most effective at driving engagement, and providing more of it. Modern platforms may be running dozens, hundreds, even thousands of tests a day. Many of these tests are automated, the decisions aren’t fully understood, and aren’t necessarily passed through some sort of ethical lens. As long as the metrics are going up and to the right, everybody is happy.

[I sometimes joke that while engagement metrics are going up and to the right, so is the tone of the conversation]

I worry that in their quest for efficiency, many of the big players have over-instrumented their design processes; removing human due-diligence, and hiding behind their OKRs in their unwavering belief that more means better.

I mean, I think we can all agree on the transformational experience of travel (carbon footprint aside). But what about the additional stress you’re causing with your cleverly worded prompt explaining it’s the “last room left at this price” and that “5 other people are looking at this room right now”? It’s not exactly a dark pattern (assuming the information is true) and I’m sure the wording tested well, but is it the “right” thing to do?

Similarly I think we can all agree how useful the recommendations are on video streaming platforms. However is it really beneficial to the user when they intended to watch one video, but end up watching three? The biggest competitor to these streaming platforms may be sleep, but is that something to optimise towards?

Now I appreciate that A/B testing—and its cousin, programmatic recommendations—aren’t solely to blame here. But the scale at which this is happening inside companies has become difficult to counter. Especially when hiding behind the argument that all we’re really doing is optimising for the same set of KPIs we’ve always been doing; we’re just getting better at it.

So while I think designers should be doing more testing than they currently are, I also think it’s important to know when not to automate decision making. I’d encourage you to start playing with the technology, and maybe even consider moving somewhere with a mature testing culture when it’s next time to switch jobs.

But it may also be worth asking the interviewers the following question:

“When was the last time you decided not to ship a change that A/B testing demonstrated performed better, and why?"

That way you can at least tell whether the dog is wagging the tail, or the tail wagging the dog.

Related thinking

Clearleft 2013 Internship - Chüne

Andy Budd

Read the story

Viewpoint

Critical questions for design leaders working with AI

Richard Rutter

Read the story

Viewpoint

Making a healthcare app in an age of austerity

Richard Rutter

Read the story