On Post-Selection Inference in A/B Tests

When a large number of simultaneous statistical inferences are conducted, unbiased estimators become biased if we purposefully select a subset of results to draw conclusions based on certain selection criteria. This happens a lot in A/B tests when there are too many metrics and segments to choose from, and only statistically significant results are considered. This paper proposes two different approaches, one based on supervised learning techniques, and the other based on empirical Bayes. We claim these two views can be unified and conduct large scale simulation and empirical study to benchmark our proposals with different existing methods. Results show our methods make substantial improvement for both point estimation and confidence interval coverage.

[1]  M. J. Bayarri,et al.  Calibration of ρ Values for Testing Precise Null Hypotheses , 2001 .

[2]  Ya Xu,et al.  Top Challenges from the first Practical Online Controlled Experiments Summit , 2019, SKDD.

[3]  Alex Deng,et al.  Trustworthy Analysis of Online A/B Tests: Pitfalls, challenges and solutions , 2017, WSDM.

[4]  Alex Deng,et al.  Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments , 2015, WWW.

[5]  E. Lehmann Testing Statistical Hypotheses , 1960 .

[6]  Bradley C. Turnbull Learning Intent to Book Metrics for Airbnb Search , 2019, WWW.

[7]  Jean Garcia-Gathright,et al.  Understanding and Evaluating User Satisfaction with Music Discovery , 2018, SIGIR.

[8]  B. Efron,et al.  Stein's Estimation Rule and Its Competitors- An Empirical Bayes Approach , 1973 .

[9]  Anmol Bhasin,et al.  From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks , 2015, KDD.

[10]  Ron Kohavi,et al.  Online controlled experiments at large scale , 2013, KDD.

[11]  Ashish Agarwal,et al.  Overlapping experiment infrastructure: more, better, faster experimentation , 2010, KDD.

[12]  E. Glen Weyl,et al.  Empirical Bayes Estimation of Treatment Effects with Many A/B Tests: An Overview , 2019, AEA Papers and Proceedings.

[13]  Alex Deng,et al.  Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[14]  I. Johnstone,et al.  Adapting to Unknown Smoothness via Wavelet Shrinkage , 1995 .

[15]  Milan Shen,et al.  Winner's Curse: Bias Estimation for Total Effects of Features in Online Controlled Experiments , 2018, KDD.

[16]  Alex Deng,et al.  Demystifying the Bias from Selective Inference: a Revisit to Dawid's Treatment Selection Problem , 2016, 1601.05835.

[17]  Alex Deng,et al.  Applying the Delta Method in Metric Analytics: A Practical Guide with Novel Ideas , 2018, KDD.

[18]  Yu Guo,et al.  Statistical inference in two-stage online controlled experiments with treatment selection and validation , 2014, WWW.

[19]  Lawrence D. Brown,et al.  SURE Estimates for a Heteroscedastic Hierarchical Model , 2012, Journal of the American Statistical Association.

[20]  Isaac Dialsingh,et al.  Large-scale inference: empirical Bayes methods for estimation, testing, and prediction , 2012 .

[21]  B. Efron The Estimation of Prediction Error , 2004 .

[22]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[23]  E. Glen Weyl,et al.  The A/B Testing Problem , 2018, EC.

[24]  Dominic Coey,et al.  Improving Treatment Effect Estimators Through Experiment Splitting , 2019, WWW.

[25]  George Casella,et al.  Statistical Inference Second Edition , 2007 .

[26]  Matthias Hein,et al.  Non-negative least squares for high-dimensional linear models: consistency and sparse recovery without regularization , 2012, 1205.0953.

[27]  J. Johndrow,et al.  A Decision Theoretic Approach to A/B Testing , 2017, 1710.03410.

[28]  Ron Kohavi,et al.  Controlled experiments on the web: survey and practical guide , 2009, Data Mining and Knowledge Discovery.

[29]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[30]  Huizhi Xie,et al.  Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix , 2016, KDD.

[31]  Pavel Dmitriev,et al.  Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners , 2019, KDD.

[32]  David Barber,et al.  Bayesian reasoning and machine learning , 2012 .

[33]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[34]  Adrian F. M. Smith,et al.  Exact and Approximate Posterior Moments for a Normal Location Parameter , 1992 .

[35]  Alexey Drutsa,et al.  Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments , 2018, WSDM.

[36]  Toniann Pitassi,et al.  Fairness through awareness , 2011, ITCS '12.

[37]  S. Senn A Note Concerning a Selection “Paradox” of Dawid's , 2008 .

[38]  Pengchuan Zhang,et al.  Concise Summarization of Heterogeneous Treatment Effect Using Total Variation Regularized Regression , 2016, 1610.03917.

[39]  B. Efron Tweedie’s Formula and Selection Bias , 2011, Journal of the American Statistical Association.

[40]  Alex Deng,et al.  Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned , 2016, KDD.

[41]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[42]  Robert Tibshirani,et al.  Post‐selection point and interval estimation of signal sizes in Gaussian samples , 2014, 1405.3340.

[43]  Drew Dimmery,et al.  Shrinkage Estimators in Online Experiments , 2019, KDD.