Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments

As A/B testing gains wider adoption in the industry, more people begin to realize the limitations of the traditional frequentist null hypothesis statistical testing (NHST). The large number of search results for the query ``Bayesian A/B testing'' shows just how much the interest in the Bayesian perspective is growing. In recent years there are also voices arguing that Bayesian A/B testing should replace frequentist NHST and is strictly superior in all aspects. Our goal here is to clarify the myth by looking at both advantages and issues of Bayesian methods. In particular, we propose an objective Bayesian A/B testing framework for which we hope to bring the best from Bayesian and frequentist methods together. Unlike traditional methods, this method requires the existence of historical A/B test data to objectively learn a prior. We have successfully applied this method to Bing, using thousands of experiments to establish the priors.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  Jeffrey N. Rouder,et al.  Bayesian t tests for accepting and rejecting the null hypothesis , 2009, Psychonomic bulletin & review.

[3]  S. Senn A Note Concerning a Selection “Paradox” of Dawid's , 2008 .

[4]  Kenneth Rice,et al.  FDR and Bayesian Multiple Comparisons Rules , 2006 .

[5]  B. Efron Frequentist accuracy of Bayesian estimates , 2015, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[6]  B. Efron Tweedie’s Formula and Selection Bias , 2011, Journal of the American Statistical Association.

[7]  Michael E J Masson,et al.  A tutorial on a practical Bayesian alternative to null-hypothesis significance testing , 2011, Behavior research methods.

[8]  Joseph Hilbe,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2009 .

[9]  J. Berger The case for objective Bayesian analysis , 2006 .

[10]  James O. Berger,et al.  The interplay of Bayesian and frequentist analysis , 2004 .

[11]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[12]  Ron Kohavi,et al.  Improving the sensitivity of online controlled experiments by utilizing pre-experiment data , 2013, WSDM.

[13]  M. J. Bayarri,et al.  Calibration of ρ Values for Testing Precise Null Hypotheses , 2001 .

[14]  Ashish Agarwal,et al.  Overlapping experiment infrastructure: more, better, faster experimentation , 2010, KDD.

[15]  J. Grossman The Likelihood Principle , 2011 .

[16]  Ron Kohavi,et al.  Trustworthy online controlled experiments: five puzzling outcomes explained , 2012, KDD.

[17]  J. Kruschke Bayesian estimation supersedes the t test. , 2013, Journal of experimental psychology. General.

[18]  Ron Kohavi,et al.  Online controlled experiments at large scale , 2013, KDD.

[19]  James G. Scott,et al.  An exploration of aspects of Bayesian multiple testing , 2006 .

[20]  Isaac Dialsingh,et al.  Large-scale inference: empirical Bayes methods for estimation, testing, and prediction , 2012 .

[21]  Christian P. Robert,et al.  Large-scale inference , 2010 .

[22]  G. Casella,et al.  The Bayesian Lasso , 2008 .

[23]  B. Efron A 250-year argument: Belief, behavior, and the bootstrap , 2012 .

[24]  Ron Kohavi,et al.  Seven rules of thumb for web site experimenters , 2014, KDD.

[25]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[26]  Yu Guo,et al.  Statistical inference in two-stage online controlled experiments with treatment selection and validation , 2014, WWW.

[27]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[28]  Ron Kohavi,et al.  Controlled experiments on the web: survey and practical guide , 2009, Data Mining and Knowledge Discovery.

[29]  V. Johnson Revised standards for statistical evidence , 2013, Proceedings of the National Academy of Sciences.

[30]  B. Efron Why Isn't Everyone a Bayesian? , 1986 .

[31]  J. Rouder Optional stopping: No problem for Bayesians , 2014, Psychonomic bulletin & review.

[32]  B. Efron Empirical Bayes modeling , computation , and accuracy , 2013 .

[33]  Alex Deng,et al.  Diluted Treatment Effect Estimation for Trigger Analysis in Online Controlled Experiments , 2015, WSDM.