A/B Testing

II. Early Stopping

In part I, we covered the basics of A/B testing in context of RCT trials. Head there if you need a refresher!

For many businesses, time is money, so when A/B tests are performed for experimental campaigns, companies want to know results as soon as possible. Tech companies in particular often have access to data almost immediately after a user interacts with a webpage or product. The need for immediacy in conjunction with instantaneous access to data makes it extremely attractive for anyone administering an A/B test to want to minimize costs by peeking at A/B test results at various intervals as data arrives, and then make a decision to keep the test running or to stop it based on current p-values. This post explores the dangers of such behavior.

To better understand the pitfalls of early stopping, instead of considering A/B tests, we will take a look at A/A tests. In A/A testing, we test the control against itself. Why would this be useful? Well, imagine if you were signing up for a new A/B testing platform, one easy way to check its reliability would be to test something you know against itself because the answer you're supposed to get back is clear. Here we will begin with a normal distribution, whose mean we know is 0, and whose standard deviation is 1 and test of subsequent trials drawing from this distribution will have the same mean.

The following code runs a simulation with n_trials (number of experiments). Each trial contains some number of samples set by the variable, samples_per_trial. Each of these samples are drawn from our normal distribution. Since each experiment is drawn from a normal distribution, the mean from the experiments should not be statistically very different from the normal distribution it was drawn from. However, we do expect some proportion of false positives when we run tests of significance due to randomness in the sampling process. In other words, our null hypothesis is that the mean of these experiments should not be different from the mean of our normal distribution (0). How many will be false positives will be determined by the level at which we set our significance level, alpha. If p<= alpha, we consider that statistically significant. Here we will set alpha to be .05, so we expect approximately 5% of the experiments to give us false positive results (since the p-value here tells us the probability of the data taking on the mean that it does, or more extreme, when we assume that the mean of the null hypothesis is true). In other words, approximately 5% of the total number of experiments will be statistically significant (seemingly different from the normal distribution under a = .05) when in fact, we know that's not the case by construction.

The result of the following code returns the proportion of false positives at alpha of 5%. We will run the simulation for 10,000 trials, each trial consisting of 100 samples:

from numpy.random import normal
from scipy.stats import ttest_1samp

# Runs n_trials, and returns how many of them are statistically significant
def run_norm_sim(n_trials, samples_per_trial, alpha = .05):
	# A 2D array with rows containing n_samples (drawn from a normal distribution)
	trials = [normal(loc=0.0, scale=1.0, size=samples_per_trial) for i in range(n_trials)] 

	# Test for significance
	significance = list(map(lambda t: ttest_1samp(t, 0), trials))

	# Return True (for significant) if less than alpha, else False
	n_sig = list(map(lambda sig: sig.pvalue <= alpha, significance))
	return 1.0*sum(n_sig)/n_trials # Return ratio of false positives

print(run_norm_sim(10000, 100))

Indeed, we see a false positive rate of approximately 5%! Try it yourself at different thresholds and different number of trials and samples.

Lets recall our goal. We are trying to see the impact of peeking at the experiments early and deciding on whether or not to stop the experiment based on what we see. We will perform the same experiment again, with 10,000 trials, each trial with 100 samples max. The only difference is now, we will peek every 20 samples and decide to stop the experiment once we see statistical significance. We will check at 5 intervals, 20, 40, 60, 80, and 100 samples. Imagine this as streaming data coming in. We are peeking at these intervals as the data becomes available. Run the following to get the rate of false positives:

# Run early stopping experiment
def early_stop(n_trials, max_per_trial, peak_interval, alpha = .05):

	# Incrementing by 'peak_interval', check for significance
	def significance_check(arr, peak_interval = peak_interval, alpha = alpha):
		for i in range(peak_interval, len(arr), peak_interval):
			if ttest_1samp(arr[:i], 0).pvalue <= alpha:
				return True
		return False

	trials = [normal(0.0, 1.0, max_per_trial) for i in range(n_trials)]
	n_sig = list(map(significance_check, trials))
	return 1.0*sum(n_sig)/n_trials

print(early_stop(10000, 100, 20))

Now we are up to about 13% false positives! That's a very big jump just for peeking 5 times for the duration of this experiment! As you can imagine, if you peek more often, the false positive rate would increase. To see this experimentally, we still peak every 20 samples, but we will vary the maximum number of samples from 400 to 50,000 (incrementing by 400). To lessen computation time, instead of checking against 10k trials for each run, we will only find the number of false positives over 100 trials. The false positive variance should be higher due to a lower number of trials to average over, but we should still get a sense of the behavior over the max samples. There are a lot of layers here, so let me clarify. The first run will have 100 trials, each trial consisting of 400 samples. Every 20th sample, we will peek at the p-value and make a decision to stop based on conclusions of statistical signifiance. We peek no more than 400 samples. The next run every variable is the same, but the max samples is set at is 800 (so the number of times we peek will increase linearly). Here's the code. It will take a while to run through the simulation, maybe 15 mins depending on computer speed. Lessen the maximum samples from 50k down to 10k, or increase peek intervals from 20 to 100, etc to run this faster:

import seaborn as sns
import matplotlib.pyplot as plt

# We are going to see how increasing the maximum trials per experiment
# impacts the false positives from early stopping
def early_stop_vary():
	x = [i for i in range(400, 50000, 400)]
	y = list(map(lambda max_per_trial: early_stop(100, max_per_trial, 20), x))
	sns.scatterplot(x, y)
	plt.xlabel('Max Samples per Trial')
	plt.ylabel('Statistically Significant Ratio (p<=.05)')


What we see here is at 50k max samples, there is over a 60% false positive rate! We are heavily penalized for peeking and making our decision based on that. In contrast, we will view the false positive rate running experiments with the same corresponding number of maximum samples, but we don't peek. Here we will increase the number of trials per run to 1000 since the cost of computation is much less.

# In contrast, we will perform the same experiment above, increasing
# max trials, but with no early stopping
def normal_trial_vary():
	x = [i for i in range(400,50000, 400)]
	y = list(map(lambda max_per_trial: run_norm_sim(1000, max_per_trial), x))
	# sns.distplot(y)
	sns.regplot(x, y)
	plt.xlabel('Max Samples per Trial')
	plt.ylabel('Statistically Significant Ratio (p<=.05)')

# normal_trial_vary()

As expected, the linear regression line shows us a stable rate of false positives at 5% (with 95% confidence interval for the y-axis) over various maximum total samples. This intuitively corresponds with our sensibilities when we set up the conditions for this A/A test. Is there a more intuitive way to understand why this happens? Lets take a look at the trajectory of the p-value for 20 different trials. For each run, we will record the p-value every 50 samples to simulate peeking at those intervals, up to a maximum of 600 samples. Since we have 20 runs, we expect that on average, at the end of 600 samples, 1 of those runs will turn out to be a false positive (5% of the total trials).

# Lets simulate the pval at incremental intervals to see why
# early stopping yields terrible results
def trial_over_time(max_samples, peak_interval, n_trials = 10):
	x = [i for i in range(peak_interval, max_samples+1, peak_interval)]
	trials = [normal(0.0, 1.0, max_samples) for i in range(n_trials)]
	pval_per_interval = []
	for trial in trials:
		pval_list = list(map(lambda interval: ttest_1samp(trial[:interval], 0).pvalue, x))

	palette = itertools.cycle(sns.color_palette("hls", 20))

	for y in pval_per_interval:
		sns.pointplot(x, y, alpha=.5, scale=.5, dodge=10, color=next(palette))
		plt.xlabel('nth Sample')

trial_over_time(600, 50, 20)

The p-value of each run at various intervals is plotted on the y-axis. You can see a horizontal line at the 5% mark which denotes that the region beneath the line is where the experiment, up to that sample region, is considered statistically significant. As we expect, if we look at the 600th sample mark, one experiment in orange is below that statistically significant threshold. On the other hand, we see up to five different colors cross the alpha threshold, which means the p-value was statistically significant when tested up to those intervals for respective runs! Had we considered a single crossing to mean statistical significance, we would have increased our false positive rate! Our test assures us of long term frequencies, it says nothing about what will happen in-between samples. For example, if you knew of a C-average student, and an A-average student, and on a single test, the C student scored higher than the A student, you would probability not want to conclude based on that observation that the C-average student is better performing. Likewise, early stopping leads to a high false positive rate.


In the third and final post on A/B testing, we will explore Bayesian methods for A/B testing equivalent, which can be more robust to this kind of error (not immune). In addition, we will compare other ways in which Bayesian methods help to decide over variants when frequentist methods are inconclusive.