Document Categorization with MAJORCLUST

This paper investigates the text categorization capabilities of two special clustering algorithms: Fuzzy k-Medoid and MAJORCLUST. Aside from quantifying the categorization performance of the mentioned algorithms, our experimental setting will also help to answer special questions related to clustering problems such as cluster number determination or cluster quality evaluation.

[1]  Arne Frick,et al.  Automatic Graph Clustering , 1996, GD.

[2]  W. T. Williams,et al.  Dissimilarity Analysis: a new Technique of Hierarchical Sub-division , 1964, Nature.

[3]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[4]  K. Florek,et al.  Sur la liaison et la division des points d'un ensemble fini , 1951 .

[5]  Pei-Yung Hsiao,et al.  A Fuzzy Clustering Algorithm for Graph Bisection , 1994, Inf. Process. Lett..

[6]  Vijay V. Raghavan,et al.  A clustering strategy based on a formalism of the reproductive process in natural systems , 1979, SIGIR '79.

[7]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[8]  John R. Cowles,et al.  Cluster Definition by the Optimization of Simple Measures , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Thomas Lengauer,et al.  Combinatorial algorithms for integrated circuit layout , 1990, Applicable theory in computer science.

[10]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[11]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[12]  Arunabha Sen,et al.  Graph Clustering Using Multiway Ratio Cut , 1997, GD.

[13]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[14]  Takenobu Tokunaga,et al.  Cluster-based text categorization: a comparison of category search strategies , 1995, SIGIR '95.

[15]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[16]  Richard C. Dubes,et al.  Experiments in projection and clustering by simulated annealing , 1989, Pattern Recognit..

[17]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[18]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[19]  P. Sneath The application of computers to taxonomy. , 1957, Journal of general microbiology.

[20]  Brian D. Davison,et al.  Human Performance on Clustering Web Pages: A Preliminary Study , 1998, KDD.

[21]  Gerald Kowalski,et al.  Information Retrieval Systems: Theory and Implementation , 1997 .

[22]  Richard M. Leahy,et al.  An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Application to Image Segmentation , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[24]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.