论文信息 - Online Evaluation for Effective Web Service Development - 字舞流文

Online Evaluation for Effective Web Service Development

Development of the majority of the leading web services and software products today is generally guided by data-driven decisions based on evaluation that ensures a steady stream of updates, both in terms of quality and quantity. Large internet companies use online evaluation on a day-to-day basis and at a large scale. The number of smaller companies using A/B testing in their development cycle is also growing. Web development across the board strongly depends on quality of experimentation platforms. In this tutorial, we overview state-of-the-art methods underlying everyday evaluation pipelines at some of the leading Internet companies. Software engineers, designers, analysts, service or product managers --- beginners, advanced specialists, and researchers --- can learn how to make web service development data-driven and do it effectively.

Gleb Gusev | Alexey Drutsa | Pavel Serdyukov | Roman Budylin | Igor Yashkov

[1] Ashish Agarwal,et al. Overlapping experiment infrastructure: more, better, faster experimentation , 2010, KDD.

[2] Gleb Gusev,et al. Periodicity in User Engagement with a Search Engine and Its Application to Online Controlled Experiments , 2017, ACM Trans. Web.

[3] Ron Kohavi,et al. Trustworthy online controlled experiments: five puzzling outcomes explained , 2012, KDD.

[4] Craig MacDonald,et al. Optimised Scheduling of Online Experiments , 2015, SIGIR.

[5] Diane Tang,et al. Focusing on the Long-term: It's Good for Users and Business , 2015, KDD.

[6] Thorsten Joachims,et al. Evaluating Retrieval Performance Using Clickthrough Data , 2003, Text Mining.

[7] Francesco Bonchi,et al. From "Dango" to "Japanese Cakes": Query Reformulation Models and Patterns , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[8] Craig MacDonald,et al. Generalized Team Draft Interleaving , 2015, CIKM.

[9] Ron Kohavi,et al. Practical guide to controlled experiments on the web: listen to your customers not to the hippo , 2007, KDD '07.

[10] Gleb Gusev,et al. Boosted Decision Tree Regression Adjustment for Variance Reduction in Online Controlled Experiments , 2016, KDD.

[11] Edoardo M. Airoldi,et al. Detecting Network Effects: Randomizing Over Randomized Experiments , 2017, KDD.

[12] Huizhi Xie,et al. Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix , 2016, KDD.

[13] Shuchi Chawla,et al. A/B Testing of Auctions , 2016, EC.

[14] Gleb Gusev,et al. Practical Aspects of Sensitivity in Online Experimentation with User Engagement Metrics , 2015, CIKM.

[15] Filip Radlinski,et al. Optimized interleaving for online retrieval evaluation , 2013, WSDM.

[16] Ryen W. White,et al. Modeling dwell time to predict click-level satisfaction , 2014, WSDM.

[17] Gleb Gusev,et al. Future User Engagement Prediction and Its Application to Improve the Sensitivity of Online Experiments , 2015, WWW.

[18] Dean Eckles,et al. Uncertainty in online experiments with dependent data: an evaluation of bootstrap methods , 2013, KDD.

[19] Ron Kohavi,et al. Online Experimentation at Microsoft , 2009 .

[20] Katja Hofmann,et al. A probabilistic method for inferring preferences from clicks , 2011, CIKM '11.

[21] Milad Shokouhi. Detecting seasonal queries by time-series analysis , 2011, SIGIR '11.

[22] Ron Kohavi,et al. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data , 2013, WSDM.

[23] Gleb Gusev,et al. Using the Delay in a Treatment Effect to Improve Sensitivity and Preserve Directionality of Engagement Metrics in A/B Experiments , 2017, WWW.

[24] Xian Wu,et al. Measuring Metrics , 2016, CIKM.

[25] Michael Bailey,et al. People and Cookies: Imperfect Treatment Assignment in Online Experiments , 2016, WWW.

[26] Yang Song,et al. Evaluating and predicting user engagement change with degraded search relevance , 2013, WWW.

[27] Yu Guo,et al. Statistical inference in two-stage online controlled experiments with treatment selection and validation , 2014, WWW.

[28] Alex Deng,et al. Diluted Treatment Effect Estimation for Trigger Analysis in Online Controlled Experiments , 2015, WSDM.

[29] M. de Rijke,et al. Multileaved Comparisons for Fast Online Evaluation , 2014, CIKM.

[30] Gleb Gusev,et al. Extreme States Distribution Decomposition Method for Search Engine Online Evaluation , 2015, KDD.

[31] Ron Kohavi,et al. Online controlled experiments at large scale , 2013, KDD.

[32] Trevor Hastie,et al. Some methods for heterogeneous treatment effect estimation in high dimensions , 2017, Statistics in medicine.

[33] Ron Kohavi,et al. Seven pitfalls to avoid when running controlled experiments on the web , 2009, KDD.

[34] S. T. Buckland,et al. An Introduction to the Bootstrap. , 1994 .

[35] Craig MacDonald,et al. Sequential Testing for Early Stopping of Online Experiments , 2015, SIGIR.

[36] Ya Xu,et al. Evaluating Mobile Apps with A/B and Quasi A/B Tests , 2016, KDD.

[37] Alexey Drutsa. Sign-Aware Periodicity Metrics of User Engagement for Online Search Quality Evaluation , 2015, SIGIR.

[38] Rosie Jones,et al. Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs , 2008, CIKM '08.

[39] Anmol Bhasin,et al. Network A/B Testing: From Sampling to Estimation , 2015, WWW.

[40] Alex Deng,et al. Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned , 2016, KDD.

[41] Thorsten Joachims,et al. Unbiased Evaluation of Retrieval Quality using Clickthrough Data , 2002 .

[42] Ron Kohavi,et al. Controlled experiments on the web: survey and practical guide , 2009, Data Mining and Knowledge Discovery.

[43] Alexey Drutsa,et al. Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments , 2018, WSDM.

[44] Ron Kohavi,et al. Seven rules of thumb for web site experimenters , 2014, KDD.

[45] Filip Radlinski,et al. Practical online retrieval evaluation , 2011, SIGIR.

[46] Pavel Serdyukov,et al. Search Engine Evaluation based on Search Engine Switching Prediction , 2015, SIGIR.

[47] Pete Koomen,et al. Peeking at A/B Tests: Why it matters, and what to do about it , 2017, KDD.

[48] Mounia Lalmas,et al. Measuring User Engagement , 2014, Measuring User Engagement.

[49] Eugene Kharitonov,et al. Learning Sensitive Combinations of A/B Test Metrics , 2017, WSDM.

[50] Alex Deng,et al. Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[51] Robert Tibshirani,et al. An Introduction to the Bootstrap , 1994 .

[52] Shie Mannor,et al. A Nonparametric Sequential Test for Online Randomized Experiments , 2016, WWW.

[53] Milad Shokouhi,et al. On correlation of absence time and search effectiveness , 2014, SIGIR.

[54] Susan Athey,et al. Recursive partitioning for heterogeneous causal effects , 2015, Proceedings of the National Academy of Sciences.

[55] Gleb Gusev,et al. Engagement Periodicity in Search Engine Usage: Analysis and its Application to Search Quality Evaluation , 2015, WSDM.

[56] Filip Radlinski,et al. Large-scale validation and analysis of interleaved search evaluation , 2012, TOIS.

[57] Hilary Hutchinson,et al. Measuring the user experience on a large scale: user-centered metrics for web applications , 2010, CHI.

[58] Filip Radlinski,et al. How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[59] Anmol Bhasin,et al. From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks , 2015, KDD.

[60] M. de Rijke,et al. Online Learning to Rank for Information Retrieval: SIGIR 2016 Tutorial , 2016, SIGIR.