Leakage in data mining: Formulation, detection, and avoidance

Deemed “one of the top ten data mining mistakes”, leakage is the introduction of information about the data mining target that should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competitions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever been. While acknowledging the importance and prevalence of leakage in both synthetic competitions and real-life data mining projects, existing literature has largely left this idea unexplored. What little has been said turns out not to be broad enough to cover more complex cases of leakage, such as those where the classical independently and identically distributed (i.i.d.) assumption is violated, that have been recently documented. In our new approach, these cases and others are explained by explicitly defining modeling goals and analyzing the broader framework of the data mining problem. The resulting definition enables us to derive general methodology for dealing with the issue. We show that it is possible to avoid leakage with a simple specific approach to data management followed by what we call a learn-predict separation, and present several ways of detecting leakage when the modeler has no control over how the data have been collected. We also offer an alternative point of view on leakage that is based on causal graph modeling concepts.

[1]  Hang Li,et al.  Hybrid Recommendation Models for Binary User Preference Prediction Problem , 2012, KDD Cup.

[2]  Elaine Shi,et al.  Link prediction by de-anonymization: How We Won the Kaggle Social Network Challenge , 2011, The 2011 International Joint Conference on Neural Networks.

[3]  Jianjun Xie,et al.  Prediction of transfers to tertiary care and hospital mortality: A gradient boosting decision tree approach , 2010, Stat. Anal. Data Min..

[4]  Yan Liu,et al.  Medical data mining: insights from winning two competitions , 2010, Data Mining and Knowledge Discovery.

[5]  John Elder,et al.  Handbook of Statistical Analysis and Data Mining Applications , 2009 .

[6]  Yan Liu,et al.  Breast cancer identification: KDD CUP winner's report , 2008, SKDD.

[7]  Sam Lightstone,et al.  Data Mining - Know It All , 2008 .

[8]  Yan Liu,et al.  Making the most of your data: KDD Cup 2007 "How Many Ratings" winner's report , 2007, SKDD.

[9]  Charles Elkan,et al.  Making generative classifiers robust to selection bias , 2007, KDD '07.

[10]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[11]  Rajesh Parekh,et al.  Lessons and Challenges from Mining Retail E-Commerce Data , 2004, Machine Learning.

[12]  Readers' Advantage Business Modeling and Data Mining , 2003 .

[13]  Dorian Pyle Business modeling and data mining , 2003 .

[14]  Ron Kohavi,et al.  Ten Supplementary Analyses to Improve E-commerce Web Sites , 2003 .

[15]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[16]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[17]  Carla E. Brodley,et al.  KDD-Cup 2000 organizers' report: peeling the onion , 2000, SKDD.

[18]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[19]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[20]  James M. Robins,et al.  Causal Inference from Complex Longitudinal Data , 1997 .

[21]  J. Pearl Causal diagrams for empirical research , 1995 .

[22]  A. Lo,et al.  Data-Snooping Biases in Tests of Financial Asset Pricing Models , 1989 .

[23]  C. Granger,et al.  Co-integration and error correction: representation, estimation and testing , 1987 .