Krimp: mining itemsets that compress

One of the major problems in pattern mining is the explosion of the number of results. Tight constraints reveal only common knowledge, while loose constraints lead to an explosion in the number of returned patterns. This is caused by large groups of patterns essentially describing the same set of transactions. In this paper we approach this problem using the MDL principle: the best set of patterns is that set that compresses the database best. For this task we introduce the Krimp algorithm. Experimental evaluation shows that typically only hundreds of itemsets are returned; a dramatic reduction, up to seven orders of magnitude, in the number of frequent item sets. These selections, called code tables, are of high quality. This is shown with compression ratios, swap-randomisation, and the accuracies of the code table-based Krimp classifier, all obtained on a wide range of datasets. Further, we extensively evaluate the heuristic choices made in the design of the algorithm.

[1]  Heikki Mannila,et al.  Multiple Uses of Frequent Sets and Condensed Representations (Extended Abstract) , 1996, KDD.

[2]  Li Wei,et al.  Compression-based data mining of sequential data , 2007, Data Mining and Knowledge Discovery.

[3]  Aristides Gionis,et al.  Assessing data mining results via swap randomization , 2007, TKDD.

[4]  Bart Goethals,et al.  Tiling Databases , 2004, Discovery Science.

[5]  Simon Parsons Advances in minimum description length by Jae Myung and Mark A. Pitt, edited by Peter D. Grünwald, MIT Press, 444 pp, ISBN 0-262-07262-9 , 2006, Knowl. Eng. Rev..

[6]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[7]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[8]  Jilles Vreeken,et al.  Item Sets that Compress , 2006, SDM.

[9]  Srinivasan Parthasarathy,et al.  Summarizing itemset patterns using probabilistic models , 2006, KDD '06.

[10]  Jianyong Wang,et al.  HARMONY: Efficiently Mining the Best Rules for Classification , 2005, SDM.

[11]  Toon Calders,et al.  Mining All Non-derivable Frequent Itemsets , 2002, PKDD.

[12]  Arne Koopman,et al.  Discovering Relational Items Sets Efficiently , 2008, SDM.

[13]  Christos Faloutsos,et al.  Fully automatic cross-associations , 2004, KDD.

[14]  Vipin Kumar,et al.  Summarization - compressing data into an informative representation , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[15]  Arno Siebes,et al.  StreamKrimp: Detecting Change in Data Streams , 2008, ECML/PKDD.

[16]  Bart Goethals,et al.  Advances in frequent itemset mining implementations: report on FIMI'03 , 2004, SKDD.

[17]  Arno J. Knobbe,et al.  Pattern Teams , 2006, PKDD.

[18]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[19]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[20]  Jiawei Han,et al.  Summarizing itemset patterns: a profile-based approach , 2005, KDD '05.

[21]  Heikki Mannila,et al.  The Pattern Ordering Problem , 2003, PKDD.

[22]  Katharina Morik,et al.  Local Pattern Detection, International Seminar, Dagstuhl Castle, Germany, April 12-16, 2004, Revised Selected Papers , 2005, Local Pattern Detection.

[23]  Heikki Mannila,et al.  Finding low-entropy sets and trees from binary data , 2007, KDD '07.

[24]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[25]  Bernhard Pfahringer,et al.  Compression-Based Feature Subset Selection , 2007 .

[26]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.

[27]  Jiawei Han,et al.  CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[28]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[29]  C. S. Wallace,et al.  Statistical and Inductive Inference by Minimum Message Length (Information Science and Statistics) , 2005 .

[30]  Arno J. Knobbe,et al.  Maximally informative k-itemsets and their efficient discovery , 2006, KDD '06.

[31]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[32]  Sunita Sarawagi,et al.  Mining Surprising Patterns Using Temporal Description Length , 1998, VLDB.

[33]  H. Warner,et al.  A mathematical approach to medical diagnosis. Application to congenital heart disease. , 1961, JAMA.

[34]  Jilles Vreeken,et al.  Finding Good Itemsets by Packing Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[35]  Hongjun Lu,et al.  A Study on the Performance of Large Bayes Classifier , 2000, ECML.

[36]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[37]  Jilles Vreeken,et al.  Preserving Privacy through Data Generation , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[38]  Heikki Mannila,et al.  Low-Entropy Set Selection , 2009, SDM.

[39]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[40]  Jan Zima,et al.  The Atlas of European Mammals , 1999 .

[41]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[42]  Jean-François Boulicaut,et al.  Simplest Rules Characterizing Classes Generated by δ-Free Sets , 2003 .

[43]  Marko Grobelnik,et al.  Guest editors’ introduction: special issue of selected papers from ECML PKDD 2009 , 2009, Data Mining and Knowledge Discovery.

[44]  Arne Koopman Characteristic relational patterns , 2009, KDD.

[45]  William I. Gasarch,et al.  Book Review: An introduction to Kolmogorov Complexity and its Applications Second Edition, 1997 by Ming Li and Paul Vitanyi (Springer (Graduate Text Series)) , 1997, SIGACT News.

[46]  J Wartak,et al.  Mathematical model for medical diagnosis. , 1974, Computers in biology and medicine.

[47]  David J. Hand,et al.  Pattern Detection and Discovery , 2002, Pattern Detection and Discovery.

[48]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[49]  Jianyong Wang,et al.  On efficiently summarizing categorical databases , 2005, Knowledge and Information Systems.

[50]  FaloutsosChristos,et al.  On data mining, compression, and Kolmogorov complexity , 2007 .

[51]  Hongjun Lu,et al.  AFOPT: An Efficient Implementation of Pattern Growth Approach , 2003, FIMI.

[52]  Jilles Vreeken,et al.  Filling in the Blanks - Krimp Minimisation for Missing Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[53]  Ian Witten,et al.  Data Mining , 2000 .

[54]  S. Knuutila,et al.  DNA copy number amplification profiling of human neoplasms , 2006, Oncogene.

[55]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[56]  Jilles Vreeken,et al.  Compression Picks Item Sets That Matter , 2006, PKDD.

[57]  Paul M. B. Vitányi,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1993, Graduate Texts in Computer Science.

[58]  Yang Xiang,et al.  Succinct summarization of transactional databases: an overlapped hyperrectangle scheme , 2008, KDD.

[59]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[60]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[61]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[62]  Albrecht Zimmermann,et al.  The Chosen Few: On Identifying Valuable Patterns , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[63]  Jilles Vreeken,et al.  Identifying the components , 2009, Data Mining and Knowledge Discovery.

[64]  Carla E. Brodley,et al.  KDD-Cup 2000 organizers' report: peeling the onion , 2000, SKDD.

[65]  Jilles Vreeken,et al.  Characterising the difference , 2007, KDD '07.

[66]  Aristides Gionis,et al.  Assessing data mining results via swap randomization , 2006, KDD '06.

[67]  Arne Koopman,et al.  Reducing the Frequent Pattern Set , 2006, ICDM Workshops.

[68]  Kotagiri Ramamohanarao,et al.  Information-Based Classification by Aggregating Emerging Patterns , 2000, IDEAL.

[69]  Philip S. Yu,et al.  GraphScope: parameter-free mining of large time-evolving graphs , 2007, KDD '07.

[70]  R. Mike Cameron-Jones,et al.  FOIL: A Midterm Report , 1993, ECML.

[71]  Jiawei Han,et al.  Mining Compressed Frequent-Pattern Sets , 2005, VLDB.