Online Evaluation for Effective Web Service Development

Development of the majority of the leading web services and software products today is generally guided by data-driven decisions based on evaluation that ensures a steady stream of updates, both in terms of quality and quantity. Large internet companies use online evaluation on a day-to-day basis and at a large scale. The number of smaller companies using A/B testing in their development cycle is also growing. Web development across the board strongly depends on quality of experimentation platforms. In this tutorial, we overview state-of-the-art methods underlying everyday evaluation pipelines at some of the leading Internet companies. Software engineers, designers, analysts, service or product managers --- beginners, advanced specialists, and researchers --- can learn how to make web service development data-driven and do it effectively.

[1]  Ashish Agarwal,et al.  Overlapping experiment infrastructure: more, better, faster experimentation , 2010, KDD.

[2]  Gleb Gusev,et al.  Periodicity in User Engagement with a Search Engine and Its Application to Online Controlled Experiments , 2017, ACM Trans. Web.

[3]  Ron Kohavi,et al.  Trustworthy online controlled experiments: five puzzling outcomes explained , 2012, KDD.

[4]  Craig MacDonald,et al.  Optimised Scheduling of Online Experiments , 2015, SIGIR.

[5]  Diane Tang,et al.  Focusing on the Long-term: It's Good for Users and Business , 2015, KDD.

[6]  Thorsten Joachims,et al.  Evaluating Retrieval Performance Using Clickthrough Data , 2003, Text Mining.

[7]  Francesco Bonchi,et al.  From "Dango" to "Japanese Cakes": Query Reformulation Models and Patterns , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[8]  Craig MacDonald,et al.  Generalized Team Draft Interleaving , 2015, CIKM.

[9]  Ron Kohavi,et al.  Practical guide to controlled experiments on the web: listen to your customers not to the hippo , 2007, KDD '07.

[10]  Gleb Gusev,et al.  Boosted Decision Tree Regression Adjustment for Variance Reduction in Online Controlled Experiments , 2016, KDD.

[11]  Edoardo M. Airoldi,et al.  Detecting Network Effects: Randomizing Over Randomized Experiments , 2017, KDD.

[12]  Huizhi Xie,et al.  Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix , 2016, KDD.

[13]  Shuchi Chawla,et al.  A/B Testing of Auctions , 2016, EC.

[14]  Gleb Gusev,et al.  Practical Aspects of Sensitivity in Online Experimentation with User Engagement Metrics , 2015, CIKM.

[15]  Filip Radlinski,et al.  Optimized interleaving for online retrieval evaluation , 2013, WSDM.

[16]  Ryen W. White,et al.  Modeling dwell time to predict click-level satisfaction , 2014, WSDM.

[17]  Gleb Gusev,et al.  Future User Engagement Prediction and Its Application to Improve the Sensitivity of Online Experiments , 2015, WWW.

[18]  Dean Eckles,et al.  Uncertainty in online experiments with dependent data: an evaluation of bootstrap methods , 2013, KDD.

[19]  Ron Kohavi,et al.  Online Experimentation at Microsoft , 2009 .

[20]  Katja Hofmann,et al.  A probabilistic method for inferring preferences from clicks , 2011, CIKM '11.

[21]  Milad Shokouhi Detecting seasonal queries by time-series analysis , 2011, SIGIR '11.

[22]  Ron Kohavi,et al.  Improving the sensitivity of online controlled experiments by utilizing pre-experiment data , 2013, WSDM.

[23]  Gleb Gusev,et al.  Using the Delay in a Treatment Effect to Improve Sensitivity and Preserve Directionality of Engagement Metrics in A/B Experiments , 2017, WWW.

[24]  Xian Wu,et al.  Measuring Metrics , 2016, CIKM.

[25]  Michael Bailey,et al.  People and Cookies: Imperfect Treatment Assignment in Online Experiments , 2016, WWW.

[26]  Yang Song,et al.  Evaluating and predicting user engagement change with degraded search relevance , 2013, WWW.

[27]  Yu Guo,et al.  Statistical inference in two-stage online controlled experiments with treatment selection and validation , 2014, WWW.

[28]  Alex Deng,et al.  Diluted Treatment Effect Estimation for Trigger Analysis in Online Controlled Experiments , 2015, WSDM.

[29]  M. de Rijke,et al.  Multileaved Comparisons for Fast Online Evaluation , 2014, CIKM.

[30]  Gleb Gusev,et al.  Extreme States Distribution Decomposition Method for Search Engine Online Evaluation , 2015, KDD.

[31]  Ron Kohavi,et al.  Online controlled experiments at large scale , 2013, KDD.

[32]  Trevor Hastie,et al.  Some methods for heterogeneous treatment effect estimation in high dimensions , 2017, Statistics in medicine.

[33]  Ron Kohavi,et al.  Seven pitfalls to avoid when running controlled experiments on the web , 2009, KDD.

[34]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[35]  Craig MacDonald,et al.  Sequential Testing for Early Stopping of Online Experiments , 2015, SIGIR.

[36]  Ya Xu,et al.  Evaluating Mobile Apps with A/B and Quasi A/B Tests , 2016, KDD.

[37]  Alexey Drutsa Sign-Aware Periodicity Metrics of User Engagement for Online Search Quality Evaluation , 2015, SIGIR.

[38]  Rosie Jones,et al.  Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs , 2008, CIKM '08.

[39]  Anmol Bhasin,et al.  Network A/B Testing: From Sampling to Estimation , 2015, WWW.

[40]  Alex Deng,et al.  Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned , 2016, KDD.

[41]  Thorsten Joachims,et al.  Unbiased Evaluation of Retrieval Quality using Clickthrough Data , 2002 .

[42]  Ron Kohavi,et al.  Controlled experiments on the web: survey and practical guide , 2009, Data Mining and Knowledge Discovery.

[43]  Alexey Drutsa,et al.  Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments , 2018, WSDM.

[44]  Ron Kohavi,et al.  Seven rules of thumb for web site experimenters , 2014, KDD.

[45]  Filip Radlinski,et al.  Practical online retrieval evaluation , 2011, SIGIR.

[46]  Pavel Serdyukov,et al.  Search Engine Evaluation based on Search Engine Switching Prediction , 2015, SIGIR.

[47]  Pete Koomen,et al.  Peeking at A/B Tests: Why it matters, and what to do about it , 2017, KDD.

[48]  Mounia Lalmas,et al.  Measuring User Engagement , 2014, Measuring User Engagement.

[49]  Eugene Kharitonov,et al.  Learning Sensitive Combinations of A/B Test Metrics , 2017, WSDM.

[50]  Alex Deng,et al.  Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[51]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[52]  Shie Mannor,et al.  A Nonparametric Sequential Test for Online Randomized Experiments , 2016, WWW.

[53]  Milad Shokouhi,et al.  On correlation of absence time and search effectiveness , 2014, SIGIR.

[54]  Susan Athey,et al.  Recursive partitioning for heterogeneous causal effects , 2015, Proceedings of the National Academy of Sciences.

[55]  Gleb Gusev,et al.  Engagement Periodicity in Search Engine Usage: Analysis and its Application to Search Quality Evaluation , 2015, WSDM.

[56]  Filip Radlinski,et al.  Large-scale validation and analysis of interleaved search evaluation , 2012, TOIS.

[57]  Hilary Hutchinson,et al.  Measuring the user experience on a large scale: user-centered metrics for web applications , 2010, CHI.

[58]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[59]  Anmol Bhasin,et al.  From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks , 2015, KDD.

[60]  M. de Rijke,et al.  Online Learning to Rank for Information Retrieval: SIGIR 2016 Tutorial , 2016, SIGIR.