Summarising Data by Clustering Items

For a book, the title and abstract provide a good first impression of what to expect from it. For a database, getting a first impression is not so straightforward. While low-order statistics only provide limited insight, mining the data quickly provides too much detail. In this paper we propose a middle ground, and introduce a parameter-free method for constructing high-quality summaries for binary data. Our method builds a summary by grouping items that strongly correlate, and uses the Minimum Description Length principle to identify the best grouping --without requiring a distance measure between items. Besides offering a practical overview of which attributes interact most strongly, these summaries are also easily-queried surrogates for the data. Experiments show that our method discovers high-quality results: correlated attributes are correctly grouped and the supports of frequent itemsets are closely approximated.

[1]  S. Knuutila,et al.  DNA copy number amplification profiling of human neoplasms , 2006, Oncogene.

[2]  Petra Perner,et al.  Advances in Data Mining , 2002, Lecture Notes in Computer Science.

[3]  Luís Torgo,et al.  Knowledge Discovery in Databases: PKDD 2005, 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, October 3-7, 2005, Proceedings , 2005, PKDD.

[4]  Malcolm P. Atkinson,et al.  Issues Raised by Three Years of Developing PJama: An Orthogonally Persistent Platform for Java , 1999, ICDT.

[5]  Yuhong Yang Elements of Information Theory (2nd ed.). Thomas M. Cover and Joy A. Thomas , 2008 .

[6]  Ruggero G. Pensa,et al.  A Bi-clustering Framework for Categorical Data , 2005, PKDD.

[7]  Heikki Mannila,et al.  Finding low-entropy sets and trees from binary data , 2007, KDD '07.

[8]  Leon G. Higley,et al.  Forensic Entomology: An Introduction , 2009 .

[9]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[10]  Jilles Vreeken,et al.  Item Sets that Compress , 2006, SDM.

[11]  Toon Calders,et al.  Mining All Non-derivable Frequent Itemsets , 2002, PKDD.

[12]  Gemma C. Garriga,et al.  Banded structure in binary matrices , 2008, Knowledge and Information Systems.

[13]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[14]  Heikki Mannila,et al.  Low-Entropy Set Selection , 2009, SDM.

[15]  Jan Komorowski,et al.  Principles of Data Mining and Knowledge Discovery , 2001, Lecture Notes in Computer Science.

[16]  Heikki Mannila,et al.  Tell me something I don't know: randomization strategies for iterative data mining , 2009, KDD.

[17]  Arno J. Knobbe,et al.  Maximally informative k-itemsets and their efficient discovery , 2006, KDD '06.

[18]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[19]  Jiawei Han,et al.  Summarizing itemset patterns: a profile-based approach , 2005, KDD '05.

[20]  Jianyong Wang,et al.  SUMMARY: efficiently summarizing transactions for clustering , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[21]  Vipin Kumar,et al.  Summarization - compressing data into an informative representation , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[22]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[23]  Jilles Vreeken,et al.  Preserving Privacy through Data Generation , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[24]  Albrecht Zimmermann,et al.  The Chosen Few: On Identifying Valuable Patterns , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[25]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[26]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[27]  Aristides Gionis,et al.  Assessing data mining results via swap randomization , 2007, TKDD.