论文信息 - Summarising Data by Clustering Items

Summarising Data by Clustering Items

For a book, the title and abstract provide a good first impression of what to expect from it. For a database, getting a first impression is not so straightforward. While low-order statistics only provide limited insight, mining the data quickly provides too much detail. In this paper we propose a middle ground, and introduce a parameter-free method for constructing high-quality summaries for binary data. Our method builds a summary by grouping items that strongly correlate, and uses the Minimum Description Length principle to identify the best grouping --without requiring a distance measure between items. Besides offering a practical overview of which attributes interact most strongly, these summaries are also easily-queried surrogates for the data. Experiments show that our method discovers high-quality results: correlated attributes are correctly grouped and the supports of frequent itemsets are closely approximated.

Jilles Vreeken | Michael Mampaey | Jilles Vreeken | Michael Mampaey

[1] S. Knuutila,et al. DNA copy number amplification profiling of human neoplasms , 2006, Oncogene.

[2] Petra Perner,et al. Advances in Data Mining , 2002, Lecture Notes in Computer Science.

[3] Luís Torgo,et al. Knowledge Discovery in Databases: PKDD 2005, 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, October 3-7, 2005, Proceedings , 2005, PKDD.

[4] Malcolm P. Atkinson,et al. Issues Raised by Three Years of Developing PJama: An Orthogonally Persistent Platform for Java , 1999, ICDT.

[5] Yuhong Yang. Elements of Information Theory (2nd ed.). Thomas M. Cover and Joy A. Thomas , 2008 .

[6] Ruggero G. Pensa,et al. A Bi-clustering Framework for Categorical Data , 2005, PKDD.

[7] Heikki Mannila,et al. Finding low-entropy sets and trees from binary data , 2007, KDD '07.

[8] Leon G. Higley,et al. Forensic Entomology: An Introduction , 2009 .

[9] P. Grünwald. The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[10] Jilles Vreeken,et al. Item Sets that Compress , 2006, SDM.

[11] Toon Calders,et al. Mining All Non-derivable Frequent Itemsets , 2002, PKDD.

[12] Gemma C. Garriga,et al. Banded structure in binary matrices , 2008, Knowledge and Information Systems.

[13] Jiawei Han,et al. Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[14] Heikki Mannila,et al. Low-Entropy Set Selection , 2009, SDM.

[15] Jan Komorowski,et al. Principles of Data Mining and Knowledge Discovery , 2001, Lecture Notes in Computer Science.

[16] Heikki Mannila,et al. Tell me something I don't know: randomization strategies for iterative data mining , 2009, KDD.

[17] Arno J. Knobbe,et al. Maximally informative k-itemsets and their efficient discovery , 2006, KDD '06.

[18] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[19] Jiawei Han,et al. Summarizing itemset patterns: a profile-based approach , 2005, KDD '05.

[20] Jianyong Wang,et al. SUMMARY: efficiently summarizing transactions for clustering , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[21] Vipin Kumar,et al. Summarization - compressing data into an informative representation , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[22] J. MacQueen. Some methods for classification and analysis of multivariate observations , 1967 .

[23] Jilles Vreeken,et al. Preserving Privacy through Data Generation , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[24] Albrecht Zimmermann,et al. The Chosen Few: On Identifying Valuable Patterns , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[25] Nicolas Pasquier,et al. Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[26] P. Grünwald. The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[27] Aristides Gionis,et al. Assessing data mining results via swap randomization , 2007, TKDD.