论文信息 - On Post-selection Inference in A/B Testing - 字舞流文

On Post-selection Inference in A/B Testing

When interpreting A/B tests, we typically focus only on the statistically significant results and take them by face value. This practice, termed post-selection inference in the statistical literature, may negatively affect both point estimation and uncertainty quantification, and therefore hinder trustworthy decision making in A/B testing. To address this issue, in this paper we explore two seemingly unrelated paths, one based on supervised machine learning and the other on empirical Bayes, and propose post-selection inferential approaches that combine the strengths of both. Through large-scale simulated and empirical examples, we demonstrate that our proposed methodologies stand out among other existing ones in both reducing post-selection biases and improving confidence interval coverage rates, and discuss how they can be conveniently adjusted to real-life scenarios.

Alex Deng | Jiannan Lu | Vivek Ramamurthy | Yicheng Li

[1] Bradley Efron,et al. Two modeling strategies for empirical Bayes estimation. , 2014, Statistical science : a review journal of the Institute of Mathematical Statistics.

[2] Alex Deng,et al. Demystifying the Bias from Selective Inference: a Revisit to Dawid's Treatment Selection Problem , 2016, 1601.05835.

[3] Ron Kohavi,et al. Online controlled experiments at large scale , 2013, KDD.

[4] Liangjie Hong,et al. A Sequential Test for Selecting the Better Variant: Online A/B testing, Adaptive Allocation, and Continuous Monitoring , 2019, WSDM.

[5] Jean Garcia-Gathright,et al. Understanding and Evaluating User Satisfaction with Music Discovery , 2018, SIGIR.

[6] Lawrence D. Brown,et al. SURE Estimates for a Heteroscedastic Hierarchical Model , 2012, Journal of the American Statistical Association.

[7] Kevin P. Murphy,et al. Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[8] David Barber,et al. Bayesian reasoning and machine learning , 2012 .

[9] E. Glen Weyl,et al. Empirical Bayes Estimation of Treatment Effects with Many A/B Tests: An Overview , 2019, AEA Papers and Proceedings.

[10] Bradley Efron,et al. Large-scale inference , 2010 .

[11] Ashish Agarwal,et al. Overlapping experiment infrastructure: more, better, faster experimentation , 2010, KDD.

[12] Gareth M. James,et al. Nonparametric Empirical Bayes Estimation on Heterogeneous Data , 2020, 2002.12586.

[13] S. Senn. A Note Concerning a Selection “Paradox” of Dawid's , 2008 .

[14] M. J. Bayarri,et al. Calibration of ρ Values for Testing Precise Null Hypotheses , 2001 .

[15] Pavel Dmitriev,et al. Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners , 2019, KDD.

[16] Alexey Drutsa,et al. Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments , 2018, WSDM.

[17] Dominic Coey,et al. Improving Treatment Effect Estimators Through Experiment Splitting , 2019, WWW.

[18] Pengchuan Zhang,et al. Concise Summarization of Heterogeneous Treatment Effect Using Total Variation Regularized Regression , 2016, 1610.03917.

[19] Ron Kohavi,et al. Controlled experiments on the web: survey and practical guide , 2009, Data Mining and Knowledge Discovery.

[20] E. Glen Weyl,et al. The A/B Testing Problem , 2018, EC.

[21] Adrian F. M. Smith,et al. Exact and Approximate Posterior Moments for a Normal Location Parameter , 1992 .

[22] Robert Tibshirani,et al. Post‐selection point and interval estimation of signal sizes in Gaussian samples , 2014, 1405.3340.

[23] Drew Dimmery,et al. Shrinkage Estimators in Online Experiments , 2019, KDD.

[24] P. J. Huber. Robust Estimation of a Location Parameter , 1964 .

[25] Alex Deng,et al. Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[26] B. Efron. Tweedie’s Formula and Selection Bias , 2011, Journal of the American Statistical Association.

[27] E. L. Lehmann,et al. Theory of point estimation , 1950 .

[28] I. Johnstone,et al. Adapting to Unknown Smoothness via Wavelet Shrinkage , 1995 .

[29] B. Efron,et al. Stein's Estimation Rule and Its Competitors- An Empirical Bayes Approach , 1973 .

[30] Euclid,et al. Statistical science : a review journal of the Institute of Mathematical Statistics. , 1986 .

[31] Toniann Pitassi,et al. Fairness through awareness , 2011, ITCS '12.

[32] Anmol Bhasin,et al. From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks , 2015, KDD.

[33] A. Dasgupta. Asymptotic Theory of Statistics and Probability , 2008 .

[34] B. Efron. The Estimation of Prediction Error , 2004 .

[35] Matthias Hein,et al. Non-negative least squares for high-dimensional linear models: consistency and sparse recovery without regularization , 2012, 1205.0953.

[36] Yu Guo,et al. Statistical inference in two-stage online controlled experiments with treatment selection and validation , 2014, WWW.

[37] Alex Deng,et al. Applying the Delta Method in Metric Analytics: A Practical Guide with Novel Ideas , 2018, KDD.

[38] Ya Xu,et al. Top Challenges from the first Practical Online Controlled Experiments Summit , 2019, SKDD.

[39] Huizhi Xie,et al. Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix , 2016, KDD.

[40] Robert L. Wolpert,et al. Statistical Inference , 2019, Encyclopedia of Social Network Analysis and Mining.

[41] George Casella,et al. Statistical Inference Second Edition , 2007 .

[42] E. Lehmann. Testing Statistical Hypotheses , 1960 .

[43] J. Johndrow,et al. A Decision Theoretic Approach to A/B Testing , 2017, 1710.03410.

[44] Student,et al. THE PROBABLE ERROR OF A MEAN , 1908 .

[45] Alex Deng,et al. Trustworthy Analysis of Online A/B Tests: Pitfalls, challenges and solutions , 2017, WSDM.

[46] Yoav Benjamini,et al. Simultaneous and selective inference: Current successes and future challenges , 2010, Biometrical journal. Biometrische Zeitschrift.

[47] Y. Benjamini,et al. Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[48] Milan Shen,et al. Winner's Curse: Bias Estimation for Total Effects of Features in Online Controlled Experiments , 2018, KDD.

[49] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[50] Bradley C. Turnbull. Learning Intent to Book Metrics for Airbnb Search , 2019, WWW.

[51] Alex Deng,et al. Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned , 2016, KDD.