You know the world has come a long way when someone has to espouse the heresy of **not** caring about statistical significance.

This is not an argument against A/B testing, but rather about how we use A/B test results to make business decisions. Instead of statistical significance, let’s make decisions based on **expected value**, i.e. *$benefit × probability − $cost*.

A little background on statistical significance, or “p < 0.05″. Say you have just deployed an A/B test, comparing the existing red (control) vs. a new green (test) “BUY NOW!” button. Two weeks later you see that the green-button variant is making $0.02 more per visitor than the red-button variant. You run some stats and see that the p-value is less than 0.05, and are ready to declare the results “significant.” ”Significant” here means that there’s an over 95% chance that the color made a difference, or more true to the statistics, there’s less than 5% chance that the $0.02 difference is simply due to random fluctuations.

That last sentence there is probably too long to fit in anyone’s attention span. Let me break it down a little. The problem here is that you need to prove, or disprove, that the difference between the two variants is real — “real” meaning generalizable to the larger audience outside of the test traffic. The philosophy of science (confirmation is indefinite while negation is absolute — a thousand white swans can’t prove that all swans are white, but one black swan can disprove that all swans are white) and practicality both require that people set out to prove that the difference is real by disproving the logical opposite, i.e. there is no real difference. Statistics allows us to figure out that if we assume there is no difference between the red- and green-button variants, the probability of observing a $0.02 or larger difference by random chance is less than 0.05, i.e. p < 0.05. That is pretty unlikely. So we accept the alternative assumption, that the difference is real.

What if you have a p-value of 0.4, i.e. a 40% chance of getting a $0.02 or larger difference simply by random fluctuations? Well, you may be asked to leave the test running for longer until it reaches “significance,” which may never happen if the two variants are really not that different, or you may be told to scrap the test.

Is that really the best decision for a business? If we start out with the alternative assumption that there is some difference between the variants, 60% of the time we will make more money with the test variant and 40% of the time we will lose money compared to the control. The net gain in extra-money-making probability is 20%. The expected size of the gain is $0.02. Say that we have 100K visitors each day, that’s $0.02 × 100,000 × 0.2 = $400 in expected extra revenue. It doesn’t cost me any extra to implement the green vs. red button. Of course I should go for the green button.

If we go back to the option of letting the test run for longer before making a decision, the upside is that we will have a more accurate estimate of the impact of the test variant. The downside is that, if one variant has $400 expected extra revenue each day, that’s $400 × (1 − traffic_in_test_variant%) extra dollars we are not taking in each day.

Now suppose you are so diligent that you keep rolling out A/B tests, this time testing a fancy search ranking algorithm. Two weeks later you see that there is a $0.10 increase in dollar spent per visitor for the test variant compared to the control (i.e. existing search ranking algorithm) variant. If the increase is real, with 100K visitors each day, that’s $0.10 × 100,000 = $10,000 dollars extra revenue each day. Now, let’s add a twist: you need five extra servers to support that fancy algorithm in production, and the servers cost $10,000 each to buy, and another $10,000 to run per year. You want to make sure it’s worth the investment. Your stats tell you that you currently have a p-value of 0.3, which most people would interpret as a “nonsignificant” result. But a p-value of 0.3 means that with the new ranking algorithm the net gain in extra-money-making probability is 0.7 − 0.3 = 0.4. With the expected size of the gain being $0.10 per visitor, the expected extra revenue per year is $0.10 × 100,000 × 0.4 × 365 = $1.46M dollars. The rational thing to do is of course release it.

Now, the $0.10 increase is the expected amount of increase. There is *risk* associated with it. In addition, humans are not rational decision makers, so a better theory is to use *expected utility* and include risk aversion in the calculation, but that’s outside of the point of this article. This article is about using statistical significance vs. expected value for making decisions.

Statistical significance is that magical point on the probability curve beyond which we accept a difference as real and beneath which we treat the difference as negligible. The problem is, as the above examples have demonstrated, probabilities fall on a continuous curve. Even if you do have a statistically significant result, a significance level of p = 0.05 means that 1 in 20 A/B comparisons will give you a statistically significant result simply by random chance. If you have 20 test variants in the same test, just by chance alone 1 in 20 of these variants will produce “statistically significant” results (unless you adjust the significance level by the number of variants).

The normal distribution (or whatever distribution you use to get the probabilities) does not come with a marker of statistical significance, much like the earth does not come with latitudinal or longitudinal lines. Those lines are added essentially arbitrarily to help you navigate, but they are not the essence of the thing you are dealing with.

The essence of the thing you are dealing with in A/B tests is *probability*. So let’s go back to the basics and make use of probabilities. Talk about benefit and probability and cost, not statistical significance. It’s no more than a line in the sand.

–

Notes:

1. The above examples assumed that the A/B tests per se were sound and that the observed differences were stable. To estimate the point at which the data is stable, use power analysis to calculate sample size.

2. Typical hypothesis testing procedure: to investigate whether an observed difference is generalizable outside of the test, we set up two competing hypotheses. The null hypothesis assumes that there is no difference between the two means, i.e. the two samples (e.g. two A/B test variants) are drawn from the same population, their means fall on the same sampling distribution. The alternative hypothesis assumes that the two samples are drawn from different populations, i.e. the means fall on two different sampling distributions. We start out assuming the null hypothesis to be true, and that the mean of the control variant represents the true mean of the population. We calculate the probability of getting the test variant mean under this assumption. If it’s less than some small number, for example p < 0.05, we reject the null hypothesis and accept the alternative hypothesis.

3. Significance levels are very much a convention and vary across disciplines and situations. Sometimes people use 0.01 or 0.001 instead of 0.05 as the significance level. As we all learned from the Higgs boson discovery, they need 5 sigmas (that translates to a p-value of about 0.0000003) to be officially accepted as a “discovery.” Traditional significance levels are biased strongly against false positives (claiming an effect to be true when it’s actually false) because of the severe cost in championing a false new theory or investing in a false new drug.

p-values don’t mean what you think they mean.

Two issues.

First is your conclusion: “The essence of the thing you are dealing with in A/B tests is probability. So let’s go back to the basics and make use of probabilities. Talk about benefit and probability and cost, not statistical significance. It’s no more than a line in the sand.”

Statistical significance IS probability. The Stat Sig tells you what the probability is that your results are simply due to chance, OR that the results you are seeing are a real and measurable effect. This is _pretty_ important to know when making a decision based on statistical analysis.

Secondly you make an inferential mistake in one of your examples:

“Your stats tell you that you currently have a p-value of 0.3, which most people would interpret as a “nonsignificant” result. But a p-value of 0.3 means that with the new ranking algorithm the net gain in extra-money-making probability is 0.7 − 0.3 = 0.4.”

A significance (P value) of 0.3 indicates that there is a 30% chance the result you are observing are simply due to chance. You can’t take a 30% chance of error and calculate how much money this will net for the business. Your calculation: “the expected extra revenue per year is $0.10 × 100,000 × 0.4 × 365 = $1.46M dollars” is NOT how you use significance.

Your calculation should be this:

We expect to see this effect when we change our search algorithm: With the expected size of the gain being $0.10 per visitor, the expected extra revenue per year is $0.10 × 100,000 × 365 = $3.65M dollars. However there is a 30% probability that the results we observed were simply due to chance, and if this 30% turns out to be the reality then the actual net benefit / cost is unpredictable with the current model.

Now most people consider 30% to be an unacceptable risk. Which is why the 5% chance of risk (or conversely 95% confidence) is a widely accepted standard.

“Statistical significance is that magical point on the probability curve beyond which we accept a difference as real and beneath which we treat the difference as negligible. The problem is, as the above examples have demonstrated, probabilities fall on a continuous curve.”

Your calculation with expected extra revenue of $3.65M dollars would be correct if there was real difference between the A/B variants. However, whether there was any real difference between the A/B variants was unknown. We only know the probability of getting the observed or larger difference under the null hypothesis. Imagine there were 100 parallel universes, in each of which we had run an A/B test like this and got the same results. If we deployed the new algorithm in all these universes, 30 of these universes would break even or lose money and 70 of them would break even or make more money. Overall we’d still come out ahead, with net gains in 40 universes. Averaging across all these universes, the average gain would be $1.46M dollars.

If the p is low, reject the ho!

I should have thought of this example earlier — suppose there are two very comparable retailers who send weekly broadcast emails to customers. Every week they set up a subject line test with 20% of the customers before sending the email out to everyone. For the subject line tests, half of the test customers get the traditional format subject line and half get a fancy new write up. Retailer A decides that they will go with the new write up only if it has “significantly” higher conversion rate than the traditional subject line. Retailer B just goes with whichever version that has the higher conversion rate in the test, paying absolutely no mind to sample size or variance and what not. Which retailer do you think will come out ahead in the long run?

I applaud your skepticism of science. I took a few philosophy classes and sometimes fall into “just because all emeralds are green…” sorts of discussions with engineer-types. Got a laugh out of your 4th paragraph, in which you apologized for the previous long sentence, and then promised a simpler breakdown, which turned out to be an epic run-on sentence.