Simultaneous Pattern and Data Clustering for Pattern Cluster Analysis

In data mining and knowledge discovery, pattern discovery extracts previously unknown regularities in the data and is a useful tool for categorical data analysis. However, the number of patterns discovered is often overwhelming. It is difficult and time-consuming to 1) interpret the discovered patterns and 2) use them to further analyze the data set. To overcome these problems, this paper proposes a new method that clusters patterns and their associated data simultaneously. When patterns are clustered, the data containing the patterns are also clustered; and the relation between patterns and data is made explicit. Such an explicit relation allows the user on the one hand to further analyze each pattern cluster via its associated data cluster, and on the other hand to interpret why a data cluster is formed via its corresponding pattern cluster. Since the effectiveness of clustering mainly depends on the distance measure, several distance measures between patterns and their associated data are proposed. Their relationships to the existing common ones are discussed. Once pattern clusters and their associated data clusters are obtained, each of them can be further analyzed individually. To evaluate the effectiveness of the proposed approach, experimental results on synthetic and real data are reported.

[1]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[2]  A.K.C. Wong,et al.  Attribute clustering for grouping, selection, and classification of gene expression data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  Yang Wang,et al.  Pattern discovery: a data driven approach to decision support , 2003, IEEE Trans. Syst. Man Cybern. Part C.

[4]  Dimitrios Gunopulos,et al.  Constraint-Based Rule Mining in Large, Dense Databases , 2004, Data Mining and Knowledge Discovery.

[5]  F. C. Mills,et al.  Statistical Methods , 1973 .

[6]  Ulrich Güntzer,et al.  Algorithms for association rule mining — a general survey and comparison , 2000, SKDD.

[7]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[8]  Abraham Silberschatz,et al.  What Makes Patterns Interesting in Knowledge Discovery Systems , 1996, IEEE Trans. Knowl. Data Eng..

[9]  Yang Wang,et al.  High-Order Pattern Discovery from Discrete-Valued Data , 1997, IEEE Trans. Knowl. Data Eng..

[10]  Andrew K. C. Wong,et al.  Typicality, Diversity, and Feature Pattern of an Ensemble , 1975, IEEE Transactions on Computers.

[11]  Wynne Hsu,et al.  Pruning and summarizing the discovered associations , 1999, KDD '99.

[12]  Andrew K. C. Wong,et al.  A discrete-valued clustering algorithm with applications to biomolecular data , 2001, Inf. Sci..

[13]  Yang Wang,et al.  From Association to Classification: Inference Using Weight of Evidence , 2003, IEEE Trans. Knowl. Data Eng..

[14]  Andrew K. C. Wong,et al.  Multiple pattern associations for interpreting structural and functional characteristics of biomolecules , 2004, Inf. Sci..

[15]  Sanjay Chawla,et al.  On local pruning of association rules using directed hypergraphs , 2004, Proceedings. 20th International Conference on Data Engineering.

[16]  D. Cox,et al.  A General Definition of Residuals , 1968 .

[17]  Joydeep Ghosh,et al.  Distance based clustering of association rules , 1999 .

[18]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[19]  Ramakrishnan Srikant,et al.  Mining Association Rules with Item Constraints , 1997, KDD.

[20]  A. Wong,et al.  Statistical analysis of residue variability in cytochrome c. , 1976, Journal of molecular biology.

[21]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[22]  David G. Stork,et al.  Pattern Classification , 1973 .

[23]  Heikki Mannila,et al.  Pruning and grouping of discovered association rules , 1995 .

[24]  Andrew K. C. Wong,et al.  Pattern Discovery by Residual Analysis and Recursive Partitioning , 1999, IEEE Trans. Knowl. Data Eng..

[25]  Neil Wrigley,et al.  Categorical Data Analysis for Geographers and Environmental Scientists , 1985 .

[26]  S. Haberman The Analysis of Residuals in Cross-Classified Tables , 1973 .