论文信息 - The difference-of-datasets framework: A statistical method to discover insight

The difference-of-datasets framework: A statistical method to discover insight

In this paper, we motivate the utility of framing very common data analysis and business intelligence problems as a problem in understanding the differences between two datasets. We call this framework the Difference-of-Datasets (DoD) framework. We propose a simple and effective method to help find the root causes of changes, i.e. “Why did the observed change happen?” or “What drove the observed change?”. Our method is based on a hypothesis test to detect the difference in the distributions of two samples, and is tailored to large-scale correlated binary data. We apply our method to several interesting scenarios, and successfully get insights to approach the fundamental reasons for unexpected changes. While our method originates from the concepts in A/B testing, it could be extended to all areas related to data science and business intelligence.

Ze Jin | Paul Raff | Ze Jin | Paul Raff

[1] J. Reed,et al. Adjusted Chi-Square Statistics: Application to Clustered Binary Data in Primary Care , 2004, The Annals of Family Medicine.

[2] Risto Vaarandi,et al. A data clustering algorithm for mining patterns from event logs , 2003, Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM 2003) (IEEE Cat. No.03EX764).

[3] Seung-Ho Kang,et al. Sample size calculations for clustered binary data , 2001, Statistics in medicine.

[4] Parag Agrawal,et al. Interpretable and Informative Explanations of Outcomes , 2014, Proc. VLDB Endow..

[5] S. Brier. Analysis of contingency tables under cluster sampling , 1980 .

[6] Ron Kohavi,et al. Online controlled experiments at large scale , 2013, KDD.

[7] B Rosner,et al. Significance testing for correlated binary outcome data. , 1988, Biometrics.

[8] A. Scott,et al. A simple method for the analysis of clustered binary data. , 1992, Biometrics.

[9] Guogen Shan,et al. Homogeneity Test for Correlated Binary Data , 2015, PloS one.

[10] Ron Kohavi,et al. Practical guide to controlled experiments on the web: listen to your customers not to the hippo , 2007, KDD '07.

[11] Michael I. Jordan,et al. Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[12] Ron Kohavi,et al. Seven pitfalls to avoid when running controlled experiments on the web , 2009, KDD.

[13] Bei Yu,et al. On generating near-optimal tableaux for conditional functional dependencies , 2008, Proc. VLDB Endow..

[14] Bernard R. Rosner,et al. Fundamentals of Biostatistics. , 1992 .