Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments

We study ratio overall evaluation criteria (user behavior quality metrics) and, in particular, average values of non-user level metrics, that are widely used in A/B testing as an important part of modern Internet companies» evaluation instruments (e.g., abandonment rate, a user»s absence time after a session). We focus on the problem of sensitivity improvement of these criteria, since there is a large gap between the variety of sensitivity improvement techniques designed for user level metrics and the variety of such techniques for ratio criteria. We propose a novel transformation of a ratio criterion to the average value of a user level (randomization-unit level, in general) metric that creates an opportunity to directly use a wide range of sensitivity improvement techniques designed for the user level that make A/B tests more efficient. We provide theoretical guarantees on the novel metric»s consistency in terms of preservation of two crucial properties (directionality and significance level) w.r.t. the source ratio criteria. The experimental evaluation of the approach is done on hundreds large-scale real A/B tests run at one of the most popular global search engines, reinforces the theoretical results, and demonstrates up to $+34%$ of sensitivity rate improvement achieved by the transformation combined with the best known regression adjustment.

[1]  Gleb Gusev,et al.  Boosted Decision Tree Regression Adjustment for Variance Reduction in Online Controlled Experiments , 2016, KDD.

[2]  Ron Kohavi,et al.  Online Experimentation at Microsoft , 2009 .

[3]  Tetsuya Sakai,et al.  Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[4]  Eugene Kharitonov,et al.  Learning Sensitive Combinations of A/B Test Metrics , 2017, WSDM.

[5]  Gleb Gusev,et al.  Future User Engagement Prediction and Its Application to Improve the Sensitivity of Online Experiments , 2015, WWW.

[6]  Dean Eckles,et al.  Uncertainty in online experiments with dependent data: an evaluation of bootstrap methods , 2013, KDD.

[7]  Alexey Drutsa Sign-Aware Periodicity Metrics of User Engagement for Online Search Quality Evaluation , 2015, SIGIR.

[8]  Ron Kohavi,et al.  Seven rules of thumb for web site experimenters , 2014, KDD.

[9]  Pavel Serdyukov,et al.  Search Engine Evaluation based on Search Engine Switching Prediction , 2015, SIGIR.

[10]  Ron Kohavi,et al.  Improving the sensitivity of online controlled experiments by utilizing pre-experiment data , 2013, WSDM.

[11]  Gleb Gusev,et al.  Engagement Periodicity in Search Engine Usage: Analysis and its Application to Search Quality Evaluation , 2015, WSDM.

[12]  Ron Kohavi,et al.  Responsible editor: R. Bayardo. , 2022 .

[13]  Yang Song,et al.  Evaluating and predicting user engagement change with degraded search relevance , 2013, WWW.

[14]  David A. Freedman,et al.  On regression adjustments to experimental data , 2008, Adv. Appl. Math..

[15]  Ashish Agarwal,et al.  Overlapping experiment infrastructure: more, better, faster experimentation , 2010, KDD.

[16]  Alex Deng,et al.  Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned , 2016, KDD.

[17]  Gleb Gusev,et al.  Using the Delay in a Treatment Effect to Improve Sensitivity and Preserve Directionality of Engagement Metrics in A/B Experiments , 2017, WWW.

[18]  Xian Wu,et al.  Measuring Metrics , 2016, CIKM.

[19]  Ron Kohavi,et al.  Practical guide to controlled experiments on the web: listen to your customers not to the hippo , 2007, KDD '07.

[20]  Shuchi Chawla,et al.  A/B Testing of Auctions , 2016, EC.

[21]  Gleb Gusev,et al.  Practical Aspects of Sensitivity in Online Experimentation with User Engagement Metrics , 2015, CIKM.

[22]  Craig MacDonald,et al.  Sequential Testing for Early Stopping of Online Experiments , 2015, SIGIR.

[23]  Ya Xu,et al.  Evaluating Mobile Apps with A/B and Quasi A/B Tests , 2016, KDD.

[24]  Diane Tang,et al.  Focusing on the Long-term: It's Good for Users and Business , 2015, KDD.

[25]  Ron Kohavi,et al.  Trustworthy online controlled experiments: five puzzling outcomes explained , 2012, KDD.

[26]  Craig MacDonald,et al.  Optimised Scheduling of Online Experiments , 2015, SIGIR.

[27]  Anmol Bhasin,et al.  From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks , 2015, KDD.

[28]  Huizhi Xie,et al.  Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix , 2016, KDD.

[29]  Amanda Spink,et al.  How to Define Searching Sessions on Web Search Engines , 2006, WEBKDD.

[30]  Mounia Lalmas,et al.  Absence time and user engagement: evaluating ranking functions , 2013, WSDM '13.

[31]  Hilary Hutchinson,et al.  Measuring the user experience on a large scale: user-centered metrics for web applications , 2010, CHI.

[32]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[33]  Alex Deng,et al.  Trustworthy Analysis of Online A/B Tests: Pitfalls, challenges and solutions , 2017, WSDM.

[34]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[35]  J. Felson,et al.  Statistical Models and Causal Inference: A Dialogue with the Social Sciences , 2011 .

[36]  Ron Kohavi,et al.  Seven pitfalls to avoid when running controlled experiments on the web , 2009, KDD.

[37]  Gleb Gusev,et al.  Periodicity in User Engagement with a Search Engine and Its Application to Online Controlled Experiments , 2017, ACM Trans. Web.

[38]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[39]  Gleb Gusev,et al.  Extreme States Distribution Decomposition Method for Search Engine Online Evaluation , 2015, KDD.

[40]  Ron Kohavi,et al.  Online controlled experiments at large scale , 2013, KDD.