On Post-selection Inference in A/B Testing

When interpreting A/B tests, we typically focus only on the statistically significant results and take them by face value. This practice, termed post-selection inference in the statistical literature, may negatively affect both point estimation and uncertainty quantification, and therefore hinder trustworthy decision making in A/B testing. To address this issue, in this paper we explore two seemingly unrelated paths, one based on supervised machine learning and the other on empirical Bayes, and propose post-selection inferential approaches that combine the strengths of both. Through large-scale simulated and empirical examples, we demonstrate that our proposed methodologies stand out among other existing ones in both reducing post-selection biases and improving confidence interval coverage rates, and discuss how they can be conveniently adjusted to real-life scenarios.

[1]  Bradley Efron,et al.  Two modeling strategies for empirical Bayes estimation. , 2014, Statistical science : a review journal of the Institute of Mathematical Statistics.

[2]  Alex Deng,et al.  Demystifying the Bias from Selective Inference: a Revisit to Dawid's Treatment Selection Problem , 2016, 1601.05835.

[3]  Ron Kohavi,et al.  Online controlled experiments at large scale , 2013, KDD.

[4]  Liangjie Hong,et al.  A Sequential Test for Selecting the Better Variant: Online A/B testing, Adaptive Allocation, and Continuous Monitoring , 2019, WSDM.

[5]  Jean Garcia-Gathright,et al.  Understanding and Evaluating User Satisfaction with Music Discovery , 2018, SIGIR.

[6]  Lawrence D. Brown,et al.  SURE Estimates for a Heteroscedastic Hierarchical Model , 2012, Journal of the American Statistical Association.

[7]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[8]  David Barber,et al.  Bayesian reasoning and machine learning , 2012 .

[9]  E. Glen Weyl,et al.  Empirical Bayes Estimation of Treatment Effects with Many A/B Tests: An Overview , 2019, AEA Papers and Proceedings.

[10]  Bradley Efron,et al.  Large-scale inference , 2010 .

[11]  Ashish Agarwal,et al.  Overlapping experiment infrastructure: more, better, faster experimentation , 2010, KDD.

[12]  Gareth M. James,et al.  Nonparametric Empirical Bayes Estimation on Heterogeneous Data , 2020, 2002.12586.

[13]  S. Senn A Note Concerning a Selection “Paradox” of Dawid's , 2008 .

[14]  M. J. Bayarri,et al.  Calibration of ρ Values for Testing Precise Null Hypotheses , 2001 .

[15]  Pavel Dmitriev,et al.  Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners , 2019, KDD.

[16]  Alexey Drutsa,et al.  Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments , 2018, WSDM.

[17]  Dominic Coey,et al.  Improving Treatment Effect Estimators Through Experiment Splitting , 2019, WWW.

[18]  Pengchuan Zhang,et al.  Concise Summarization of Heterogeneous Treatment Effect Using Total Variation Regularized Regression , 2016, 1610.03917.

[19]  Ron Kohavi,et al.  Controlled experiments on the web: survey and practical guide , 2009, Data Mining and Knowledge Discovery.

[20]  E. Glen Weyl,et al.  The A/B Testing Problem , 2018, EC.

[21]  Adrian F. M. Smith,et al.  Exact and Approximate Posterior Moments for a Normal Location Parameter , 1992 .

[22]  Robert Tibshirani,et al.  Post‐selection point and interval estimation of signal sizes in Gaussian samples , 2014, 1405.3340.

[23]  Drew Dimmery,et al.  Shrinkage Estimators in Online Experiments , 2019, KDD.

[24]  P. J. Huber Robust Estimation of a Location Parameter , 1964 .

[25]  Alex Deng,et al.  Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[26]  B. Efron Tweedie’s Formula and Selection Bias , 2011, Journal of the American Statistical Association.

[27]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[28]  I. Johnstone,et al.  Adapting to Unknown Smoothness via Wavelet Shrinkage , 1995 .

[29]  B. Efron,et al.  Stein's Estimation Rule and Its Competitors- An Empirical Bayes Approach , 1973 .

[30]  Euclid,et al.  Statistical science : a review journal of the Institute of Mathematical Statistics. , 1986 .

[31]  Toniann Pitassi,et al.  Fairness through awareness , 2011, ITCS '12.

[32]  Anmol Bhasin,et al.  From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks , 2015, KDD.

[33]  A. Dasgupta Asymptotic Theory of Statistics and Probability , 2008 .

[34]  B. Efron The Estimation of Prediction Error , 2004 .

[35]  Matthias Hein,et al.  Non-negative least squares for high-dimensional linear models: consistency and sparse recovery without regularization , 2012, 1205.0953.

[36]  Yu Guo,et al.  Statistical inference in two-stage online controlled experiments with treatment selection and validation , 2014, WWW.

[37]  Alex Deng,et al.  Applying the Delta Method in Metric Analytics: A Practical Guide with Novel Ideas , 2018, KDD.

[38]  Ya Xu,et al.  Top Challenges from the first Practical Online Controlled Experiments Summit , 2019, SKDD.

[39]  Huizhi Xie,et al.  Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix , 2016, KDD.

[40]  Robert L. Wolpert,et al.  Statistical Inference , 2019, Encyclopedia of Social Network Analysis and Mining.

[41]  George Casella,et al.  Statistical Inference Second Edition , 2007 .

[42]  E. Lehmann Testing Statistical Hypotheses , 1960 .

[43]  J. Johndrow,et al.  A Decision Theoretic Approach to A/B Testing , 2017, 1710.03410.

[44]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[45]  Alex Deng,et al.  Trustworthy Analysis of Online A/B Tests: Pitfalls, challenges and solutions , 2017, WSDM.

[46]  Yoav Benjamini,et al.  Simultaneous and selective inference: Current successes and future challenges , 2010, Biometrical journal. Biometrische Zeitschrift.

[47]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[48]  Milan Shen,et al.  Winner's Curse: Bias Estimation for Total Effects of Features in Online Controlled Experiments , 2018, KDD.

[49]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[50]  Bradley C. Turnbull Learning Intent to Book Metrics for Airbnb Search , 2019, WWW.

[51]  Alex Deng,et al.  Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned , 2016, KDD.