Computing Clusters of Correlation Connected objects

The detection of correlations between different features in a set of feature vectors is a very important data mining task because correlation indicates a dependency between the features or some association of cause and effect between them. This association can be arbitrarily complex, i.e. one or more features might be dependent from a combination of several other features. Well-known methods like the principal components analysis (PCA) can perfectly find correlations which are global, linear, not hidden in a set of noise vectors, and uniform, i.e. the same type of correlation is exhibited in all feature vectors. In many applications such as medical diagnosis, molecular biology, time sequences, or electronic commerce, however, correlations are not global since the dependency between features can be different in different subgroups of the set. In this paper, we propose a method called 4C (Computing Correlation Connected Clusters) to identify local subgroups of the data objects sharing a uniform but arbitrarily complex correlation. Our algorithm is based on a combination of PCA and density-based clustering (DBSCAN). Our method has a determinate result and is robust against noise. A broad comparative evaluation demonstrates the superior performance of 4C over competing methods such as DBSCAN, CLIQUE and ORCLUS.

[1]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[2]  Hans-Peter Kriegel,et al.  A distribution-based clustering algorithm for mining in large spatial databases , 1998, Proceedings 14th International Conference on Data Engineering.

[3]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[4]  C. Ball,et al.  Saccharomyces Genome Database. , 2002, Methods in enzymology.

[5]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[6]  Philip S. Yu,et al.  MaPle: a fast algorithm for maximal pattern-based clustering , 2003, Third IEEE International Conference on Data Mining.

[7]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[8]  Ping Chen,et al.  Using the fractal dimension to cluster datasets , 2000, KDD '00.

[9]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[10]  Bernhard Liebl,et al.  Very high compliance in an expanded MS-MS-based newborn screening program despite written parental consent. , 2002, Preventive medicine.

[11]  Sharad Mehrotra,et al.  Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces , 2000, VLDB.

[12]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD 2000.

[13]  Wei Wang,et al.  OP-cluster: clustering by tendency in high dimensional space , 2003, Third IEEE International Conference on Data Mining.

[14]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[15]  Christos Faloutsos,et al.  How to Use the Fractal Dimension to Find Correlations between Attributes , 2002 .

[16]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[17]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[18]  Christian Böhm,et al.  Independent quantization: an index compression technique for high-dimensional data spaces , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[19]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[20]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[21]  Philip S. Yu,et al.  /spl delta/-clusters: capturing subspace correlation in a large data set , 2002, Proceedings 18th International Conference on Data Engineering.

[22]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[23]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.