by Manojit Nandi on August 18th, 2015
In this post, I discuss a method for A/B testing using BetaBinomial Hierarchical models to correct for a common pitfall when testing multiple hypotheses. I will compare it to the classical method of using Bernoulli models for pvalue, and cover other advantages hierarchical models have over the classical model. My Python code is available on Domino.
Introduction
Imagine the following scenario: You work for a company that gets most of its online traffic through ads. Your current ads have a 3% click rate, and your boss decides that's not good enough. The marketing team comes up with 26 new ad designs, and as the company's data scientist, it's your job to determine if any of these new ads have a higher click rate than the current ad.
You set up an online experiment where internet users are shown one of the 27 possible ads (the current ad or one of the 26 new designs). After two weeks, you collect the data on each ad: How many users saw it, and how many times it was clicked on.
Time to run some statistical tests! New design A vs current design? No statistically significant difference. New design B vs current design? No statistically significant difference. You keep running test and continue getting not significant results. Just as you are about lose hope, new design Z v. current design.... statistically significant difference at the alpha = 0.05 level!
You tell your boss you've found a design that has a higher click rate than the current design, and your company deploys it in production. However, after two months of collecting statistics on the new design, it seems the new design has a click rate of 3%. What went wrong?
When performing A/B testing, data scientists often fall into the common pitfall of failing to correct to for multiple testing. Testing at alpha = 0.05 means your statistical test yielding a result as extreme or more extreme by random chance (assuming a given null hypothesis is true) occurs with probability 0.05. If you run 26 statistical tests, then an upper bound on the expected number of false positives is 26*0.05 = 1.3. This means in our above scenario, our data scientist can expect to have at least one false positive result, and unfortunately, the false positive result is the one she reported to her boss.
The Problem
Let's say we are designing Domino's website, and we're testing five different layouts for the landing page. When a user clicks on our site, she sees one of the five landing page layouts. From there, the user can decide whether or not she wants to create an account on Domino.
Website  Clicks  SignUps  Empirical Rate  True Rate 

A  1055  28  0.027  0.032 
B  1057  45  0.043  0.041 
C  1065  69  0.065  0.058 
D  1039  58  0.056  0.047 
E  1046  60  0.057  0.051 
As a disclaimer, this is simulated data that I created to mimic a real A/B testing data. The number of signups for each website was generated by generating a number from a Binomial distribution with n = Clicks and p = True Rate. The empirical rate represents Signups/Clicks.
Preliminary: Bayesian Statistics
There are two wellknown branches of statistics: Frequentist statistics and Bayesian statistics. These two branches have plenty of differences, but we're going to focus on one key difference:
 In frequentist statistics, we assume the parameter(s) of interest are fixed constants. We focus on computing the likelihood Prob(Data  Parameter), the probability we see the observed set of data points given the parameter of interest.
 In Bayesian statistics, we having uncertainty surrounding our parameter(s) of interest, and we mathematically capture our priori uncertainty about these parameters in a prior distribution, formally represented as Prob(Parameter). We focus on computing the posterior distribution Prob(Parameter  Data), representing our posteriori uncertainty surrounding the parameter of interest after we have observed data.
Thanks to Bayes's Theorem, we know that the posterior Prob(Parameter  Data) is proportional to Prob(Data  Parameter)·Prob(Parameter), the likelihood multiplied by the prior distribution.
In our problem, the parameters of interest are the true rates at which people signup on each website, and the observed data is the number of clicks and signups each website gets. Because we are performing Bayesian statistics, we are interested in modeling the posterior distribution:
Prob(True SignUp Rate Data) ~ Prob(DataTrue SignUp Rate)·Prob(True SignUp Rate).
Classic Bernoulli Model
To build Bayesian models in Python, we'll be using the Bayesian schocastic modeling library PyMC.
import pymc
Cam DavidsonPilon has written a great book on Bayesian models in PyMC that I recommend to anyone who is interested in learning Bayesian statistics or how to program Bayesian models in Python. The PyMC code in this section is based on A/B Testing example found in his book.
Before we look at BetaBinomial Hierarchical model method, let's first look at how we would perform A/B Testing in the standard two website case with Bernoulli models. Recall from above, website A had 1055 clicks and 27 signups, and website B had 1057 clicks and 45 signups.
# Website A had 1055 clicks and 28 signups values_A = np.hstack(([0]*(105528),[1]*28)) # Website B had 1057 clicks and 45 signups values_B = np.hstack(([0]*(105745),[1]*45))
Now, we can model each possible signup as a Bernoulli event. Recall the Bernoulli distribution reflects the outcome of a coin flip. With some probability p, the coin flips head and with probability 1p, the coin flips tails. The intuition behind this model is as follows: A user visits the website. The user flips a coin. If coin lands head, the user signs up.
Now, let's say each website has its own coin. Website A has a coin that lands heads with probability p_{A}, and Website B has a coin that lands heads with probability p_{B}. We don't know either probabilities, but we want to determine if p_{A} < p_{B} or if the reverse is true (There is also the possibility that p_{A} = p_{B}).
Since we have no information or bias about the true values of p_{A} or p_{B}, we will draw these two parameters from a Uniform distribution. In addition, I've created a delta function to create the posterior distribution for the difference of the two distributions.
# Create a uniform prior for the probabilities p_a and p_b p_A = pymc.Uniform('p_A', 0, 1) p_B = pymc.Uniform('p_B', 0, 1) # Creates a posterior distribution of B  A @pymc.deterministic def delta(p_A = p_A, p_B = p_B): return p_B  p_A
Next, we will create an observations variable for each website that incorporates the signup data for each website. We then use pymc to run a MCMC (Markov Chain Monte Carlo) to sample points from each website's posterior distribution. For the uninitiated, MCMC is a class of algorithms for sampling from a desired distribution by constructing an equilibrium distribution that has the properties of the desired distribution. Because the MCMC sampler needs time to converge to the equilibrium, we throw out the first 500,000 sampled points, so we have only points sampled from a (hopefully) converged equilibrium distribution.
# Create the Bernoulli variables for the observation obs_A = pymc.Bernoulli('obs_A', p_A, value = values_A , observed = True) obs_B = pymc.Bernoulli('obs_B', p_B, value = values_B , observed = True) # Create the model and run the sampling model = pymc.Model([p_A, p_B, delta, values_A, values_B]) mcmc = pymc.MCMC(model) # Sample 1,000,000 million points and throw out the first 500,000 mcmc.sample(1000000, 500000)
Now, let' examine the posterior of the delta distribution (Remember, this is the posterior of Website B  posterior of Website A).
delta_distribution = mcmc.trace('delta')[:] sns.kdeplot(delta_distribution, shade = True) plt.axvline(0.00, color = 'black')
The black line is at x = 0, representing where the difference between the two distributions is 0. From inspection, we see that most of the distribution's mass is to the right of the black line. This means most of the points sampled from B's distribution are larger than those sampled from A's distribution.
To get more quantitative results, we can compute the probability that website A gets more signups than website B.
print "Probability that website A gets MORE signups than site B: %0.3f" % (delta_distribution < 0).mean() print "Probability that website A gets LESS signups than site B: %0.3f" % (delta_distribution > 0).mean()
Probability that website A gets MORE signups than site B: 0.023
Probability that website A gets LESS signups than site B: 0.977
For these two websites, we see that website B outperforms website A. This worked well for two websites, but if we tried this for all pairs of our five websites, we run the risk of getting a "false positive problem" due to the multiple testing problem. There are 10 possible pairs, so assume we test all possible pairs independently at an alpha = 0.05 level. For each test, we have a 95% chance of not getting a false positive, so the probability that all the tests do not yield a false positive is (0.95)^{10}, which is roughly equal to 0.60. This means the probability of getting at least one false positive result is about 0.40 or 40%. If we had more websites and thus more pairs to test, the probability of getting a false positive would only increase.
As you can see, without correcting for multiple testing, we run a high risk of encountering a false positive result. In the next section, I'll show how hierarchical models implicitly correct for this multiple testing pitfall and yield more accurate results.
Beta Distribution
Before I introduce the BetaBinomial hierarchical model, I want to discuss the theoretical motivation for the Beta distribution. Let's consider the Uniform Distribution over the interval (0,1).
As you can see, it's a simple distribution. It assigns equal probability weight to all points in the domain (0,1), also known as the support of the distribution. However, what if we want a distribution over (0,1) that isn't just flat everywhere?
This is where the Beta distribution comes in! The Beta distribution can be seen as a generalization of the Uniform(0,1) as it allows us to define more general probability density functions over the interval (0,1). Using two parameters a and b, the Beta(a,b) distribution is defined with the following probability density function:
Where C is a constant to normalize the integral of the function to 1 (all probability density functions must integrate to 1). This constant is formally known as the Beta Function. By changing the values of a and b, we can change the shape and the "steepness" of the distribution, thus allowing us to easily create a wide variety of functions over the interval (0,1).
Notice in the above plot that the green line corresponding to the distribution Beta(1,1) is the same as that of Uniform(0,1). If you plug in a, b = 1 into the pdf of the Beta distribution, you get pdf(x) = B(1,1)·x^{1  1}·(1x)^{11} = B(1,1)·x^{0}·(1x)^{0} = B(1,1)·1·1 = 1, where B(1,1) is the Beta function evaluated at (1,1).
Therefore the pdf of Beta(1,1) is 1 everywhere, so it is equivalent to the pdf of the Uniform(0,1), thus proving that the Beta distribution is indeed a generalization of the Uniform(0,1).
Now, I'm sure many of you are wondering what's the big takeaway from this section, so here they are:
 The Beta Distribution is a versatile family of probability distributions over (0,1).
 This allows us to create prior distributions that incorporate some of our beliefs and thus informative priors.
In the next section, I will discuss why this is important, and how the Beta Distribution can be used for A/B testing.
Hierarchical Modeling
In this section, we will model the website signups using a Binomial Distribution. The intuition behind Binomial(n,p) is that if we flip a coin with probability p of landing heads n times, how likely is it that we see k heads for some k between 0 and n.
In the Classical Bernoulli Method section, I said we had no prior information about the true rates for each website, so we used Uniform(0,1) as our uninformative prior. In this section, we will assume each true signups rate is drawn from a Beta distribution.
Now, the number of signups, k_{i}, is modeled by Binomial(n_{i},p_{i}), and the true signup rate for each website, p_{i}, is drawn from Beta(a,b). Now, we've neglected one important question up until this point? How do we choose a and b?
We could assign a prior distribution to choose these hyperparameters, but then our prior distribution has a prior distribution and it's priors all the way down.
A better alternative is to sample a and b from the distribution
f(a,b) ~ (a+b)^{5/2} where a,b > 0.
I'm sure this function looks like math magic to many of you, but an indepth explanation of where this probability distribution comes from can be found in Andrew Gelman's book or in this tutorial on Youtube.
The general idea behind this function is that it corporates ideas about the logit of the mean, log(a/b), and the log of the "sample size", log(a+b), to define a proper probability distribution for these two parameters. If the previous sentence means nothing to you, do not worry. All it is saying is that if we use this function, we get desired mathematical properties when sampling for a and b.
Now that I've covered the theory, we can finally build our hierarchical model. The code in this section is based on Stater Stich's Use a Hierarchical Model tutorial.
The beta priors function samples a and b for us from the function defined above. The Stochastic decorator in PyMC requires we return the loglikelihood, so we will be returning the log of (a+b)^{2.5}.
@pymc.stochastic(dtype=np.float64) def beta_priors(value=[1.0, 1.0]): a, b = value if a <= 0 or b <= 0: return np.inf else: return np.log(np.power((a + b), 2.5)) a = beta_priors[0] b = beta_priors[1]
We then model the true signup rates as Beta distribution and use our observed signup data to construct the Binomial distribution. Once again use MCMC to sample 1,000,000 data points from an equilibrium distribution (throwing out the first 500,000 again).
# The hidden, true rate for each website. true_rates = pymc.Beta('true_rates', a, b, size=5) # The observed values trials = np.array([1055, 1057, 1065, 1039, 1046]) successes = np.array([28, 45, 69, 58, 60]) observed_values = pymc.Binomial('observed_values', trials, true_rates, observed=True, value=successes) model = pymc.Model([a, b, true_rates, observed_values]) mcmc = pymc.MCMC(model) # Generate 1,000,000 samples, and throw out the first 500,000 mcmc.sample(1000000, 500000)
Let's see what our five posterior distributions look like.
Now that we have all five posterior distributions, we can easily computer the difference between any two of them. In particular, let's look at the difference between Website C and Website A.
diff_CA = mcmc.trace('true_rates')[:][:,2]  mcmc.trace('true_rates')[:][:,0] sns.kdeplot(diff_CA, shade = True, label = "Difference site C  site A") plt.axvline(0.0, color = 'black')
Once again, we see that most of this difference distribution's mass is to the right of the vertical black line, meaning that most of the points sampled from C's distribution are greater than those sampled from A's distribution. If we quantify these results, we get:
print "Probability that website A gets MORE signups than website C: %0.3f" % (diff_CA < 0).mean() print "Probability that website A gets LESS signups than website C: %0.3f" % (diff_CA > 0).mean()
Probability that website A gets MORE signups than website C: 0.000
Probability that website A gets LESS signups than website C: 1.000
We see that with almost 100% certainty that website C is better than website A. For those of you skeptical of this result, keep in mind that true signup rate I chose to simulate data for website C is nearly twice of that of website A, so such a pronounced difference in the posterior distribution of website C and website A is to be expected.
In addition, thanks to the hierarchical model, we can perform multiple tests in succession without having to adjust for multiple comparisons. For example, let's revisit the difference of the posterior distribution of website B and website A.
Once again, we see most of the probability mass of this posterior distribution lies to the right of the line x = 0.00. We can quantify these results using the same method we used for the difference between website C and website A.
Probability that website A gets MORE signups than website B: 0.028
Probability that website A gets LESS signups than website B: 0.972
Again, we see results showing that website B has a higher signup rate than website A at a statistically significant level, same as the result when using the classical Bernoulli model.
It should be noted that the hierarchical model cannot overcome the limitations of data. For example, let's consider website D and website E. While these two websites have a differing true signup rate (website E being better than website D), they have virtually identical click and signup data. If we plot the two posterior distributions for this website, we see they overlap greatly.
As a result, our difference of the posterior yields a distribution centered about 0.0, and we cannot conclude that one website has a higher signup rate at a statistically significant level.
Probability that website D gets MORE signups than website E: 0.437
Probability that website D gets LESS signups than website E: 0.563
With the hierarchical model, I can test the hypothesis that website C and website A have differing true signup rates and the hypothesis that website B and website A have a differing true signup rate one after another, something we could not do with the Bernoulli model without making some adjustments to our statistical test.
Comparing the two methods
Now I have shown you two different ways to do A/B testing: One way using Bernoulli distributions, and another using BetaBinomial distributions. I motivated the BetaBinomial hierarchical model using the problem of multiple comparison testing. Let's now compare the performance of the two methods by examining the posterior distribution generated for Website A by each of the two methods. I've included a black vertical line at x = 0.32, the true rate probability I used to simulate the data.
sns.kdeplot(siteA_distribution, shade = True, label = "Bernoulli Model") sns.kdeplot(mcmc.trace('true_rates')[:][:,0], shade = True, label = "Hierachical Beta") plt.axvline(0.032, color = 'black')
As you can see, the mass of the Hierarchical BetaBinomial model is closer to the true rate that that of the Bernoulli model. The posteriors of the hierarchical model gave us a closer estimate of the true rate in this case.
Why does the Hierarchical BetaBinomial model appear to be more accurate in estimating the true rate? Well this comes down to the prior distributions used in each method. In the classical Bernoulli method, we used the Uniform(0,1) as our prior distribution. As I mentioned earlier, this is an uninformative prior as it assigns equal weight to all possible probabilities. On the other hand, the Beta prior creates a distribution that puts some of the probability mass towards the "truth" and thus we see a more a bit accurate estimate for the posterior distribution. If we slightly altered the values a and b for Beta distribution, it's possible we could see no difference between the two posteriors.
It's possible there is some other mathematical function that we could use as our prior that would result in a more accurate posterior, but we have no guarantees that the resulting posterior would be easily computable. The Beta distribution is a conjugate prior to the Binomial distribution, so we are guaranteed the resulting posterior distribution will be easy to compute (In this case, the posterior will always be a Beta distribution).
Problem of Multiple Testing
Finally, I'm going to address the question everyone has been thinking while reading this! Why do hierarchical models solve the problem of multiple testing?
Let's consider two extremes: In one extreme, we model each website having the same true signup rate. This means p_{A} = p_{B} = ... = p_{E}, and thus completely pooling all of the signup data from each website together to estimate a "mean" signup rate across all five websites. In the other extreme, we model the signup rate for each website completely separate from one another and thus focus solely on individual signup rates. The Hierarchical models is a compromise between these two extremes, called partial pooling.
In the Hierarchical model, we recognize that since all five websites are modeling the same phenomenon, we ignore some information that holds across all five websites when we model each website in isolation. By incorporating a main effect happening across all five websites while preserving individual effects, we reduce the statistical uncertainty in our models, and thus reducing the issue of multiple comparison. A more detailed explanation of partial pooling and why Hierarchical models reduce the issue of multiple comparisons can be found in this paper (Section 3.2, pages 79).
One important thing to note is that partial pooling is not unique to Bayesian statistics. For example in machine learning, MultiTask Learning deals with training classifiers to learn a group of related problems using a common representation across all the problems. When we apply a group shrinkage penalty (this mitigates the influence of the common representation on each classifier's decision), we see the partial pooling effect because each classifier's output incorporates the mitigated common representation along with individual representations for each problem.
Lastly, I believe it's worthwhile to mention the most widely known approach to deal with multiple comparisons: The Bonferroni Correction. If you recall from your introductory statistics course, the standard statistical inference procedure revolved around rejected a null hypothesis (H_{0}) if the likelihood of the observed data given the null hypothesis is too small. We quantified what "is too small" meant using an α (alpha) value to represent our threshold for statistical significance (i.e. α = 0.05).
The Bonferroni correction states that if we are going to be testing n hypotheses, then we need to lower our threshold for statistical significance to α/n. For example, if we are planning to test at the α = 0.05 level and we will be testing 10 hypotheses, then the Bonferroni correction states we should test at the 0.05/10 = 0.005 level. This means instead of rejecting our null hypothesis when we get a pvalue less than 0.05, we reject our null hypothesis when we observe a pvalue less than 0.005. There are other procedures to correct for multiple comparisons depending on the statistical test involved.
Conclusion
Data scientist use A/B testing to make meaningful impacts in their company's performance. Good A/B testing can lead to milliondollar results, but good A/B testing is also more complicated than you would think. There are many pitfalls that can lead to meaningless results.
Bayesian Hierarchical models provide an easy method for A/B testing that overcomes some of these pitfalls that plague data scientist. This blog post is inspired by Slater Stich's Statistical Advice for A/B Testing series, so anyone interested in learning more about these pitfalls can check out his posts.
If you want further reading on applied Bayesian method, I recommend this book. In addition to being a good book, it has the most adorable cover. Finally, Udacity recently released a free online course on A/B Testing.
As always, the code for this blog post can be found here. Since the MCMC sampling takes time, you can run the code on Domino's platform, so you don't have to wait in front of your computer while the MCMC sampling is running.

Jim

Gan Gao

Zilions

Márcio Fernandes

Cristian Heredia