Comparing sets of patterns with the Jaccard index

The ability to extract knowledge from data has been the driving force of Data Mining since its inception, and of statistical modeling long before even that. Actionable knowledge often takes the form of patterns, where a set of antecedents can be used to infer a consequent. In this paper we offer a solution to the problem of comparing different sets of patterns. Our solution allows comparisons between sets of patterns that were derived from different techniques (such as different classification algorithms), or made from different samples of data (such as temporal data or data perturbed for privacy reasons). We propose using the Jaccard index to measure the similarity between sets of patterns by converting each pattern into a single element within the set. Our measure focuses on providing conceptual simplicity, computational simplicity, interpretability, and wide applicability. The results of this measure are compared to prediction accuracy in the context of a real-world data mining scenario.

[1]  Charu C. Aggarwal,et al.  On the design and quantification of privacy preserving data mining algorithms , 2001, PODS.

[2]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[3]  Md Zahidul Islam,et al.  Privacy preserving data mining: A noise addition framework using a novel clustering technique , 2011, Knowl. Based Syst..

[4]  Assaf Schuster,et al.  Data mining with differential privacy , 2010, KDD.

[5]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[6]  Cynthia Rudin,et al.  Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model , 2015, ArXiv.

[7]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[8]  Anand D. Sarwate,et al.  Differentially Private Empirical Risk Minimization , 2009, J. Mach. Learn. Res..

[9]  Md Zahidul Islam,et al.  Quality Evaluation of an Anonymized Dataset , 2014, 2014 22nd International Conference on Pattern Recognition.

[10]  Chris Clifton,et al.  Thoughts on k-Anonymization , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[11]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[12]  Alex Alves Freitas,et al.  Comprehensible classification models: a position paper , 2014, SKDD.

[13]  S. Kotsiantis,et al.  Discretization Techniques: A recent survey , 2006 .

[14]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.

[15]  Md Zahidul Islam,et al.  Knowledge Discovery through SysFor - a Systematically Developed Forest of Multiple Decision Trees , 2011, AusDM.

[16]  A. H. Lipkus A proof of the triangle inequality for the Tanimoto distance , 1999 .

[17]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[18]  Ljiljana Brankovic,et al.  Measuring Data Quality: Predictive Accuracy vs. Similarity of Decision Trees , 2003 .

[19]  David J. Hand,et al.  Classifier Technology and the Illusion of Progress , 2006, math/0606441.

[20]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[21]  Kai Zhao,et al.  Evaluating association rules and decision trees to predict multiple target attributes , 2011, Intell. Data Anal..

[22]  Myra Spiliopoulou,et al.  Efficient Monitoring of Patterns in Data Mining Environments , 2003, ADBIS.

[23]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[24]  Zahidul Islam,et al.  Measuring Information Quality for Privacy Preserving Data Mining , 2014 .

[25]  Bart Baesens,et al.  An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models , 2011, Decis. Support Syst..

[26]  Geoffrey I. Webb,et al.  Generality Is Predictive of Prediction Accuracy , 2006, Selected Papers from AusDM.

[27]  Jiawei Han,et al.  Discriminative Frequent Pattern Analysis for Effective Classification , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[28]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[29]  Daniel Kifer,et al.  Injecting utility into anonymized datasets , 2006, SIGMOD Conference.

[30]  Kiri Wagstaff,et al.  Machine Learning that Matters , 2012, ICML.

[31]  Rich Caruana,et al.  Data mining in metric space: an empirical analysis of supervised learning performance criteria , 2004, ROCAI.

[32]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[33]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[34]  C. Willmott,et al.  Ambiguities inherent in sums-of-squares-based error statistics , 2009 .

[35]  Md Zahidul Islam,et al.  A Differentially Private Decision Forest , 2015, AusDM.

[36]  Mark A. Girolami,et al.  Putting the Scientist in the Loop -- Accelerating Scientific Progress with Interactive Machine Learning , 2014, 2014 22nd International Conference on Pattern Recognition.

[37]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[38]  M. Levandowsky,et al.  Distance between Sets , 1971, Nature.

[39]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[40]  Hoeteck Wee,et al.  Toward Privacy in Public Databases , 2005, TCC.

[41]  Philip S. Yu,et al.  Differentially private data release for data mining , 2011, KDD.

[42]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[43]  Paulo J. G. Lisboa,et al.  Making machine learning models interpretable , 2012, ESANN.

[44]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[45]  Sebastian Nowozin,et al.  Decision Jungles: Compact and Rich Models for Classification , 2013, NIPS.

[46]  C. Willmott Some Comments on the Evaluation of Model Performance , 1982 .

[47]  Mary Felkin Comparing Classification Results between N-ary and Binary Problems , 2007, Quality Measures in Data Mining.