Online Experimentation Diagnosis and Troubleshooting Beyond AA Validation

Online experiments are frequently used at internet companies to evaluate the impact of new designs, features, or code changes on user behavior. Though the experiment design is straightforward in theory, in practice, there are many problems that can complicate the interpretation of results and render any conclusions about changes in user behavior invalid. Many of these problems are difficult to detect and often go unnoticed. Acknowledging and diagnosing these issues can prevent experiment owners from making decisions based on fundamentally flawed data. When conducting online experiments, data quality assurance is a top priority before attributing the impact to changes in user behavior. While some problems can be detected by running AA tests before introducing the treatment, many problems do not emerge during the AA period, and appear only during the AB period. Prior work on this topic has not addressed troubleshooting during the AB period. In this paper, we present lessons learned from experiments on various internet consumer products at Yahoo, as well as diagnostic and remedy procedures. Most of the examples and troubleshooting procedures presented here are generic to online experimentation at other companies. Some, such as traffic splitting problems and outlier problems have been documented before, but others have not previously been described in the literature.

[1]  Berkant Barla Cambazoglu,et al.  Impact of response latency on user behavior in web search , 2014, SIGIR.

[2]  A. Madansky Identification of Outliers , 1988 .

[3]  J. Fleiss The design and analysis of clinical experiments , 1987 .

[4]  Ashish Agarwal,et al.  Overlapping experiment infrastructure: more, better, faster experimentation , 2010, KDD.

[5]  Ron Kohavi,et al.  Trustworthy online controlled experiments: five puzzling outcomes explained , 2012, KDD.

[6]  Clemens Holzmann,et al.  Enabling A/B Testing of Native Mobile Applications by Remote User Interface Exchange , 2013, EUROCAST.

[7]  E. Acuña,et al.  A Meta analysis study of outlier detection methods in classification , 2004 .

[8]  Ron Kohavi,et al.  Improving the sensitivity of online controlled experiments by utilizing pre-experiment data , 2013, WSDM.

[9]  Dechang Chen,et al.  The Theory of the Design of Experiments , 2001, Technometrics.

[10]  Ron Kohavi,et al.  Seven pitfalls to avoid when running controlled experiments on the web , 2009, KDD.

[11]  Michael S. Bernstein,et al.  Designing and deploying online field experiments , 2014, WWW.

[12]  Bart P. Knijnenburg Conducting user experiments in recommender systems , 2012, RecSys.

[13]  David H. Reiley "Which half Is wasted?": controlled experiments to measure online-advertising effectiveness , 2011, KDD.

[14]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[15]  Ron Kohavi,et al.  Responsible editor: R. Bayardo. , 2022 .

[16]  Ron Kohavi,et al.  Seven rules of thumb for web site experimenters , 2014, KDD.

[17]  Daniel M. Russell,et al.  Query logs alone are not enough , 2007 .

[18]  Ron Kohavi,et al.  Unexpected results in online controlled experiments , 2011, SKDD.

[19]  Ron Kohavi,et al.  Online Experimentation at Microsoft , 2009 .

[20]  Ron Kohavi Online controlled experiments: introduction, learnings, and humbling statistics , 2012, RecSys '12.

[21]  Ron Kohavi,et al.  Practical guide to controlled experiments on the web: listen to your customers not to the hippo , 2007, KDD '07.

[22]  Foster J. Provost,et al.  Measuring Causal Impact of Online Actions via Natural Experiments: Application to Display Advertising , 2015, KDD.

[23]  Margaret J. Robertson,et al.  Design and Analysis of Experiments , 2006, Handbook of statistics.

[24]  Sunita Chulani,et al.  Analysis of customer satisfaction survey data , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[25]  Hongxing He,et al.  A comparative study of RNN for outlier detection in data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[26]  Anjan Goswami,et al.  Controlled experiments for decision-making in e-Commerce search , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[27]  Tanja Zseby,et al.  Empirical evaluation of hash functions for multipoint measurements , 2008, CCRV.

[28]  Ramesh Govindan,et al.  SIF: a selective instrumentation framework for mobile applications , 2013, MobiSys '13.

[29]  J. P. Morgan,et al.  Design and Analysis: A Researcher's Handbook , 2005, Technometrics.

[30]  Ron Kohavi,et al.  Online controlled experiments at large scale , 2013, KDD.

[31]  Anmol Bhasin,et al.  From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks , 2015, KDD.

[32]  Wei Xu,et al.  Advances and challenges in log analysis , 2011, Commun. ACM.

[33]  R. Webster Design of Experiments for Agriculture and the Natural Sciences (Second Edition) , 2007 .

[34]  Albrecht Schmidt,et al.  Knowing the User's Every Move – User Activity Tracking for Website Usability Evaluation and Implicit Interaction , 2006 .

[35]  Carlos Soares,et al.  Outlier Detection using Clustering Methods: a data cleaning application , 2004 .