Analysis of Clustering Algorithms for Web-Based Search

Automatic document categorization plays a key role in the development of future interfaces for Web-based search. Clustering algorithms are considered as a technology that is capable of mastering this "ad-hoc" categorization task.This paper presents results of a comprehensive analysis of clustering algorithms in connection with document categorization. The contributions relate to exemplar-based, hierarchical, and density-based clustering algorithms. In particular, we contrast ideal and real clustering settings and present runtime results that are based on efficient implementations of the investigated algorithms.

[1]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[2]  James C. Bezdek,et al.  Cluster validation with generalized Dunn's indices , 1995, Proceedings 1995 Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems.

[3]  Peter Bruza,et al.  Web searching: A process-oriented experimental study of three interactive search paradigms , 2002, J. Assoc. Inf. Sci. Technol..

[4]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[5]  John R. Cowles,et al.  Cluster Definition by the Optimization of Simple Measures , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Thomas Lengauer,et al.  Combinatorial algorithms for integrated circuit layout , 1990, Applicable theory in computer science.

[7]  Gerald Salton,et al.  Automatic text processing , 1988 .

[8]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[9]  Benno Stein,et al.  On the Nature of Structure and Its Identification , 1999, WG.

[10]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[11]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[12]  K. Florek,et al.  Sur la liaison et la division des points d'un ensemble fini , 1951 .

[13]  Arne Frick,et al.  Automatic Graph Clustering , 1996, GD.

[14]  James C. Bezdek,et al.  A geometric approach to cluster validity for normal mixtures , 1997, Soft Comput..

[15]  P. Sneath The application of computers to taxonomy. , 1957, Journal of general microbiology.

[16]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[17]  Arunabha Sen,et al.  Graph Clustering Using Multiway Ratio Cut , 1997, GD.

[18]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[19]  Gerald Kowalski,et al.  Information Retrieval Systems: Theory and Implementation , 1997 .

[20]  Richard M. Leahy,et al.  An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Application to Image Segmentation , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Pei-Yung Hsiao,et al.  A Fuzzy Clustering Algorithm for Graph Bisection , 1994, Inf. Process. Lett..

[22]  Teuvo Kohonen,et al.  Self-organization and associative memory: 3rd edition , 1989 .

[23]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .