Cluster cores-based clustering for high dimensional data

We propose a new approach to clustering high dimensional data based on a novel notion of cluster cores, instead of on nearest neighbors. A cluster core is a fairly dense group with a maximal number of pairwise similar objects. It represents the core of a cluster, as all objects in a cluster are with a great degree attracted to it. As a result, building clusters from cluster cores achieves high accuracy. Other major characteristics of the approach include: (1) It uses a semantics-based similarity measure. (2) It does not incur the curse of dimensionality and is scalable linearly with the dimensionality of data. (3) It outperforms the well-known clustering algorithm, ROCK, with both lower time complexity and higher accuracy.

[1]  Ian Witten,et al.  Data Mining , 2000 .

[2]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[3]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[4]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[5]  Daniel A. Keim,et al.  Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering , 1999, VLDB.

[6]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[7]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[8]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[9]  Mauricio G. C. Resende,et al.  Greedy Randomized Adaptive Search Procedures , 1995, J. Glob. Optim..

[10]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[11]  Ron Shamir,et al.  A clustering algorithm based on graph connectivity , 2000, Inf. Process. Lett..

[12]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[13]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[14]  R. Jarvis,et al.  ClusteringUsing a Similarity Measure Based on SharedNear Neighbors , 1973 .

[15]  Panos M. Pardalos,et al.  On maximum clique problems in very large graphs , 1999, External Memory Algorithms.

[16]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[17]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[18]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[19]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[20]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[21]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[22]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[23]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.