Trustworthy Analysis of Online A/B Tests: Pitfalls, challenges and solutions

A/B tests (or randomized controlled experiments) play an integral role in the research and development cycles of technology companies. As in classic randomized experiments (e.g., clinical trials), the underlying statistical analysis of A/B tests is based on assuming the randomization unit is independent and identically distributed (\iid). However, the randomization mechanisms utilized in online A/B tests can be quite complex and may render this assumption invalid. Analysis that unjustifiably relies on this assumption can yield untrustworthy results and lead to incorrect conclusions. Motivated by challenging problems arising from actual online experiments, we propose a new method of variance estimation that relies only on practically plausible assumptions, is directly applicable to a wide of range of randomization mechanisms, and can be implemented easily. We examine its performance and illustrate its advantages over two commonly used methods of variance estimation on both simulated and empirical datasets. Our results lead to a deeper understanding of the conditions under which the randomization unit can be treated as \iid In particular, we show that for purposes of variance estimation, the randomization unit can be approximated as \iid when the individual treatment effect variation is small; however, this approximation can lead to variance under-estimation when the individual treatment effect variation is large.

[1]  Kari Lock Morgan,et al.  Rerandomization to improve covariate balance in experiments , 2012, 1207.5625.

[2]  D. Basu Randomization Analysis of Experimental Data: The Fisher Randomization Test , 1980 .

[3]  T. Hesterberg,et al.  What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum , 2014, The American statistician.

[4]  R. Longbotham,et al.  Choice of the Randomization Unit in Online Controlled Experiment , 2011 .

[5]  Pieter Bastiaan Ober Asymptotic Theory of Statistics and Probability , 2011 .

[6]  T. Speed,et al.  On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9 , 1990 .

[7]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[8]  W. Lin,et al.  Agnostic notes on regression adjustments to experimental data: Reexamining Freedman's critique , 2012, 1208.2301.

[9]  Larry Wasserman,et al.  All of Statistics: A Concise Course in Statistical Inference , 2004 .

[10]  Art B. Owen,et al.  THE PIGEONHOLE BOOTSTRAP , 2007, 0712.1111.

[11]  Ashish Agarwal,et al.  Overlapping experiment infrastructure: more, better, faster experimentation , 2010, KDD.

[12]  N. Smirnov Table for Estimating the Goodness of Fit of Empirical Distributions , 1948 .

[13]  Robert Karl,et al.  Holistic configuration management at Facebook , 2015, SOSP.

[14]  Stefan Wager,et al.  Teaching Statistics at Google-Scale , 2015 .

[15]  D. Rubin Randomization Analysis of Experimental Data: The Fisher Randomization Test Comment , 1980 .

[16]  Alex Deng,et al.  Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned , 2016, KDD.

[17]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[18]  Donald B Rubin,et al.  Rerandomization to Balance Tiers of Covariates , 2015, Journal of the American Statistical Association.

[19]  Dean Eckles,et al.  Uncertainty in online experiments with dependent data: an evaluation of bootstrap methods , 2013, KDD.

[20]  D. Rubin For objective causal inference, design trumps analysis , 2008, 0811.1640.

[21]  Ron Kohavi,et al.  Online Experiments: Practical Lessons , 2010, Computer.

[22]  Susan Athey,et al.  The Econometrics of Randomized Experiments , 2016, 1607.00698.

[23]  Anmol Bhasin,et al.  From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks , 2015, KDD.

[24]  D. Rubin Estimating causal effects of treatments in randomized and nonrandomized studies. , 1974 .

[25]  Deng Alex,et al.  Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing , 2016 .

[26]  Ron Kohavi,et al.  Online Controlled Experiments and A / B Tests , 2015 .

[27]  Ron Kohavi,et al.  Seven rules of thumb for web site experimenters , 2014, KDD.

[28]  David Barber,et al.  Bayesian reasoning and machine learning , 2012 .

[29]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[30]  Michael S. Bernstein,et al.  Designing and deploying online field experiments , 2014, WWW.

[31]  Ron Kohavi,et al.  Controlled experiments on the web: survey and practical guide , 2009, Data Mining and Knowledge Discovery.

[32]  Alex Deng,et al.  Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[33]  Alex Deng,et al.  Demystifying the Bias from Selective Inference: a Revisit to Dawid's Treatment Selection Problem , 2016, 1601.05835.

[34]  J. Pearl,et al.  Causal Inference , 2011, Twenty-one Mental Models That Can Change Policing.

[35]  CARLOS A. GOMEZ-URIBE,et al.  The Netflix Recommender System , 2015, ACM Trans. Manag. Inf. Syst..

[36]  D. Rubin,et al.  Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction , 2016 .