Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data

Finding clusters in data, especially high dimensional data, is challenging when the clusters are of widely diering shapes, sizes, and densities, and when the data contains noise and outliers. We present a novel clustering technique that addresses these issues. Our algorithm rst nds the nearest neighbors of each data point and then redenes the similarity between pairs of points in terms of how many nearest neighbors the two points share. Using this denition of similarity, our algorithm identies core points and then builds clusters around the core points. The use of a shared nearest neighbor denition of similarity alleviates problems with varying densities and high dimensionality, while the use of core points handles problems with shape and size. While our algorithm can nd the \dense" clusters that other clustering algorithms nd, it also nds clusters that these approaches overlook, i.e., clusters of low or medium density which represent relatively uniform regions \surrounded" by non-uniform or higher density areas. We experimentally show that our algorithm performs better than traditional methods (e.g., K-means, DBSCAN, CURE) on a variety of data sets: KDD Cup '99 network intrusion data, NASA Earth science time series data, two-dimensional point sets, and documents. The run-time complexity of our technique is O(n 2 ) if the similarity matrix has to be constructed. However, we discuss a number of optimizations that allow the algorithm to handle large data sets ecien tly.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  R. Jarvis,et al.  ClusteringUsing a Similarity Measure Based on SharedNear Neighbors , 1973 .

[3]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[4]  Peter Willett,et al.  Comparison of Hierarchie Agglomerative Clustering Methods for Document Retrieval , 1989, Comput. J..

[5]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[6]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[7]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[8]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[9]  Gerald Kowalski,et al.  Information Retrieval Systems: Theory and Implementation , 1997 .

[10]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[11]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[12]  G. Taylor Impacts of the El Ni?o/Southern Oscillation on the Pacific Northwest , 1998 .

[13]  Vipin Kumar,et al.  Multilevel Algorithms for Multi-Constraint Graph Partitioning , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[14]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[15]  Daniel A. Keim,et al.  Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering , 1999, VLDB.

[16]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[17]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[18]  Philip S. Yu,et al.  On the merits of building categorization systems by supervised clustering , 1999, KDD '99.

[19]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[20]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[21]  George Karypis,et al.  Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval , 2000, CIKM '00.

[22]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[23]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[24]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[25]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[26]  M. Steinbach,et al.  Clustering Earth Science Data: Goals, Issues and Results , 2001 .

[27]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[28]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[29]  Levent Ertoz,et al.  A New Shared Nearest Neighbor Clustering Algorithm and its Applications , 2002 .

[30]  M. Steinbach,et al.  Data Mining for the Discovery of Ocean Climate Indices , 2002 .

[31]  Vipin Kumar,et al.  Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach , 2003, Clustering and Information Retrieval.

[32]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[33]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[34]  Yong Wang,et al.  A Comparison of Document Clustering Algorithms , 2005, PRIS.