Demystifying dark matter for online experimentation

The rise of online controlled experimentation, a.k.a. A/B testing began around the turn of the millennium with the emergence of internet giants like Amazon, Bing, Facebook, Google, LinkedIn, and Yahoo. A step towards good experimental design includes the planning for sample size, confidence level, metrics to be measured and test duration. Generally, these factors impact the quality and validity of an experiment. In practice, additional factors may also impact the validity of an experiment. One such critical factor is the discrepancy between the planned bucket size and the actual bucket size. We call this hidden gap “Experimentation Dark Matter”. Experimentation dark matter is invisible to A/A or A/B validation of experimental analysis but can impact the validity of an experiment. In this paper, we have demonstrated in detail, this gap that may cause the loss of statistical power as well as the loss of representativeness and generalizability of an experiment. We have proposed a framework to monitor experimentation dark matter that may go unnoticed in a balanced AB test. We have further discussed the remediation of a recent dark matter issue using our framework. This scalable, low-latency framework is effective and applicable to similar online controlled experimentation systems.

[1]  P. Hsu Contribution to the theory of "Student's" t-test as applied to the problem of two samples. , 1938 .

[2]  Tanja Zseby,et al.  Empirical evaluation of hash functions for multipoint measurements , 2008, CCRV.

[3]  Ron Kohavi,et al.  Practical guide to controlled experiments on the web: listen to your customers not to the hippo , 2007, KDD '07.

[4]  Deep Ganguli,et al.  Druid: a real-time analytical data store , 2014, SIGMOD Conference.

[5]  Zhenyu Zhao,et al.  Online Experimentation Diagnosis and Troubleshooting Beyond AA Validation , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[6]  Ashish Agarwal,et al.  Overlapping experiment infrastructure: more, better, faster experimentation , 2010, KDD.

[7]  C. J. Burke,et al.  The use and misuse of the chi-square test. , 1949, Psychological bulletin.

[8]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[9]  David H. Reiley "Which half Is wasted?": controlled experiments to measure online-advertising effectiveness , 2011, KDD.

[10]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[11]  Panos Vassiliadis A Survey of Extract-Transform-Load Technology , 2009, Int. J. Data Warehous. Min..

[12]  G. Keppel,et al.  Design and Analysis: A Researcher's Handbook , 1976 .

[13]  Henning Schulzrinne,et al.  Indicating User Agent Capabilities in the Session Initiation Protocol (SIP) , 2004, RFC.

[14]  Anmol Bhasin,et al.  From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks , 2015, KDD.

[15]  Anjan Goswami,et al.  Controlled experiments for decision-making in e-Commerce search , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[16]  Russell S. Winer,et al.  Experimentation in the 21st century: The importance of external validity , 1999 .

[17]  Margaret J. Robertson,et al.  Design and Analysis of Experiments , 2006, Handbook of statistics.

[18]  Ron Kohavi,et al.  Unexpected results in online controlled experiments , 2011, SKDD.

[19]  Carmen R. Wilson VanVoorhis,et al.  Understanding Power and Rules of Thumb for Determining Sample Sizes , 2007 .