Density-based Cluster Algorithms in Low-dimensional and High-dimensional Applications Second International Workshop on Text-based Information Retrieval (tir 05)

Cluster analysis is the art of detecting groups of similar objects in large data sets— without having specified these groups by means of explicit features. Among the various cluster algorithms that have been developed so far the density-based algorithms count to the most advanced and robust approaches. However, this paper shows that density-based cluster analysis embodies no principle with clearly defined algorithmic properties. We contrast the density-based cluster algorithms DBSCAN and MajorClust, which have been developed having different clustering tasks in mind, and whose strengths and weaknesses can be explained against the background of the dimensionality of the data to be clustered. Our motivation for this analysis comes from the field of information retrieval, where cluster analysis plays a key role in solving the document categorization problem. The paper is organized as follows: Section 1 recapitulates the important principles of cluster algorithms, Section 2 discusses the density-based algorithms DBSCAN and MajorClust, and Section 3 illustrates the strengths and weaknesses of both algorithms on the basis of geometric data analysis and document categorization problems.

[1]  Arne Frick,et al.  Automatic Graph Clustering , 1996, GD.

[2]  Brian Everitt,et al.  Cluster analysis , 1974 .

[3]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[4]  Vipin Kumar,et al.  Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification , 2001, PAKDD.

[5]  A. Guttman,et al.  A Dynamic Index Structure for Spatial Searching , 1984, SIGMOD 1984.

[6]  James C. Bezdek,et al.  Cluster validation with generalized Dunn's indices , 1995, Proceedings 1995 Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems.

[7]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[8]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[9]  Richard M. Leahy,et al.  An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Application to Image Segmentation , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Pei-Yung Hsiao,et al.  A Fuzzy Clustering Algorithm for Graph Bisection , 1994, Inf. Process. Lett..

[11]  Benno Stein,et al.  On Cluster Validity and the Information Need of Users , 2003 .

[12]  Paolo Rosso,et al.  An Approach to Clustering Abstracts , 2005, NLDB.

[13]  Vipin Kumar,et al.  Analysis of Multilevel Graph Partitioning , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[14]  P. Sneath The application of computers to taxonomy. , 1957, Journal of general microbiology.

[15]  Arunabha Sen,et al.  Graph Clustering Using Multiway Ratio Cut , 1997, GD.

[16]  K. Florek,et al.  Sur la liaison et la division des points d'un ensemble fini , 1951 .

[17]  Matthew Chalmers,et al.  Fast Multidimensional Scaling Through Sampling, Springs and Interpolation , 2003, Inf. Vis..

[18]  Benno Stein,et al.  On the Nature of Structure and Its Identification , 1999, WG.

[19]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[20]  Takenobu Tokunaga,et al.  Cluster-based text categorization: a comparison of category search strategies , 1995, SIGIR '95.

[21]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[22]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[23]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[24]  Benno Stein,et al.  Document Categorization with MAJORCLUST , 2002 .

[25]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[26]  Thomas Lengauer,et al.  Combinatorial algorithms for integrated circuit layout , 1990, Applicable theory in computer science.

[27]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[28]  Vijay V. Raghavan,et al.  A clustering strategy based on a formalism of the reproductive process in natural systems , 1979, SIGIR '79.

[29]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[30]  John R. Cowles,et al.  Cluster Definition by the Optimization of Simple Measures , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[32]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[33]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[34]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Richard C. Dubes,et al.  Experiments in projection and clustering by simulated annealing , 1989, Pattern Recognit..

[36]  Charles M. Fiduccia,et al.  A linear-time heuristic for improving network partitions , 1988, 25 years of DAC.

[37]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.