The difference-of-datasets framework: A statistical method to discover insight

In this paper, we motivate the utility of framing very common data analysis and business intelligence problems as a problem in understanding the differences between two datasets. We call this framework the Difference-of-Datasets (DoD) framework. We propose a simple and effective method to help find the root causes of changes, i.e. “Why did the observed change happen?” or “What drove the observed change?”. Our method is based on a hypothesis test to detect the difference in the distributions of two samples, and is tailored to large-scale correlated binary data. We apply our method to several interesting scenarios, and successfully get insights to approach the fundamental reasons for unexpected changes. While our method originates from the concepts in A/B testing, it could be extended to all areas related to data science and business intelligence.

[1]  J. Reed,et al.  Adjusted Chi-Square Statistics: Application to Clustered Binary Data in Primary Care , 2004, The Annals of Family Medicine.

[2]  Risto Vaarandi,et al.  A data clustering algorithm for mining patterns from event logs , 2003, Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM 2003) (IEEE Cat. No.03EX764).

[3]  Seung-Ho Kang,et al.  Sample size calculations for clustered binary data , 2001, Statistics in medicine.

[4]  Parag Agrawal,et al.  Interpretable and Informative Explanations of Outcomes , 2014, Proc. VLDB Endow..

[5]  S. Brier Analysis of contingency tables under cluster sampling , 1980 .

[6]  Ron Kohavi,et al.  Online controlled experiments at large scale , 2013, KDD.

[7]  B Rosner,et al.  Significance testing for correlated binary outcome data. , 1988, Biometrics.

[8]  A. Scott,et al.  A simple method for the analysis of clustered binary data. , 1992, Biometrics.

[9]  Guogen Shan,et al.  Homogeneity Test for Correlated Binary Data , 2015, PloS one.

[10]  Ron Kohavi,et al.  Practical guide to controlled experiments on the web: listen to your customers not to the hippo , 2007, KDD '07.

[11]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[12]  Ron Kohavi,et al.  Seven pitfalls to avoid when running controlled experiments on the web , 2009, KDD.

[13]  Bei Yu,et al.  On generating near-optimal tableaux for conditional functional dependencies , 2008, Proc. VLDB Endow..

[14]  Bernard R. Rosner,et al.  Fundamentals of Biostatistics. , 1992 .

[15]  Zhenyu Zhao,et al.  Online Experimentation Diagnosis and Troubleshooting Beyond AA Validation , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[16]  Ron Kohavi,et al.  Trustworthy online controlled experiments: five puzzling outcomes explained , 2012, KDD.

[17]  Ron Kohavi,et al.  Improving the sensitivity of online controlled experiments by utilizing pre-experiment data , 2013, WSDM.

[18]  A Donner,et al.  Statistical methods in ophthalmology: an adjusted chi-square approach. , 1989, Biometrics.

[19]  Ron Kohavi,et al.  Responsible editor: R. Bayardo. , 2022 .

[20]  Laks V. S. Lakshmanan,et al.  The Generalized MDL Approach for Summarization , 2002, VLDB.

[21]  B Rosner,et al.  Statistical methods in ophthalmology: an adjustment for the intraclass correlation between eyes. , 1982, Biometrics.

[22]  Divesh Srivastava,et al.  Efficient and Effective Analysis of Data Quality using Pattern Tableaux , 2011, IEEE Data Eng. Bull..

[23]  Ling Huang,et al.  Mining Console Logs for Large-Scale System Problem Detection , 2008, SysML.

[24]  Seung-Ho Kang,et al.  Tests for 2×K contingency tables with clustered ordered categorical data , 2001, Statistics in medicine.

[25]  Suju Rajan,et al.  Beyond clicks: dwell time for personalization , 2014, RecSys '14.

[26]  Ron Kohavi,et al.  Unexpected results in online controlled experiments , 2011, SKDD.