A/B Testing at Scale: Accelerating Software Innovation

The Internet provides developers of connected software, including web sites, applications, and devices, an unprecedented opportunity to accelerate innovation by evaluating ideas quickly and accurately using controlled experiments, also known as A/B tests. From front-end user-interface changes to backend algorithms, from search engines (e.g., Google, Bing, Yahoo!) to retailers (e.g., Amazon, eBay, Etsy) to social networking services (e.g., Facebook, LinkedIn, Twitter) to travel services (e.g., Expedia, Airbnb, Booking.com) to many startups, online controlled experiments are now utilized to make data-driven decisions at a wide range of companies. While the theory of a controlled experiment is simple, and dates back to Sir Ronald A. Fisher's experiments at the Rothamsted Agricultural Experimental Station in England in the 1920s, the deployment and evaluation of online controlled experiments at scale (100's of concurrently running experiments) across variety of web sites, mobile apps, and desktop applications presents many pitfalls and new research challenges. In this tutorial we will give an introduction to A/B testing, share key lessons learned from scaling experimentation at Bing to thousands of experiments per year, present real examples, and outline promising directions for future work. The tutorial will go beyond applications of A/B testing in information retrieval and will also discuss on practical and research challenges arising in experimentation on web sites and mobile and desktop apps. Our goal in this tutorial is to teach attendees how to scale experimentation for their teams, products, and companies, leading to better data-driven decisions. We also want to inspire more academic research in the relatively new and rapidly evolving field of online controlled experimentation.

[1]  Zhenyu Zhao,et al.  Online Experimentation Diagnosis and Troubleshooting Beyond AA Validation , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[2]  Xian Wu,et al.  Measuring Metrics , 2016, CIKM.

[3]  Jan Bosch,et al.  The Evolution of Continuous Experimentation in Software Product Development: From Data to a Data-Driven Organization at Scale , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[4]  Pengchuan Zhang,et al.  Concise Summarization of Heterogeneous Treatment Effect Using Total Variation Regularized Regression , 2016, 1610.03917.

[5]  Ron Kohavi,et al.  Trustworthy online controlled experiments: five puzzling outcomes explained , 2012, KDD.

[6]  Georg Buscher,et al.  Principles for the Design of Online A/B Metrics , 2016, SIGIR.

[7]  Alex Deng,et al.  Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments , 2015, WWW.

[8]  Ron Kohavi,et al.  Online controlled experiments at large scale , 2013, KDD.

[9]  Alex Deng,et al.  Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[10]  Alex Deng,et al.  Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned , 2016, KDD.

[11]  Ron Kohavi Online Controlled Experiments: Lessons from Running A/B/n Tests for 12 Years , 2015, KDD.

[12]  Ron Kohavi,et al.  Pitfalls of long-term online controlled experiments , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[13]  Ron Kohavi,et al.  Improving the sensitivity of online controlled experiments by utilizing pre-experiment data , 2013, WSDM.