Centroid-Based Document Classification: Analysis and Experimental Results

In this paper we present a simple linear-time centroid-based document classification algorithm, that despite its simplicity and robust performance, has not been extensively studied and analyzed. Our experiments show that this centroidbased classifier consistently and substantially outperforms other algorithms such as Naive Bayesian, k-nearest-neighbors, and C4.5, on a wide range of datasets. Our analysis shows that the similarity measure used by the centroid-based scheme allows it to classify a new document based on how closely its behavior matches the behavior of the documents belonging to different classes. This matching allows it to dynamically adjust for classes with different densities and accounts for dependencies between the terms in the different classes

[1]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[2]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[3]  Vipin Kumar,et al.  Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification , 2001, PAKDD.

[4]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[5]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[6]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[7]  Paul Thompson,et al.  Automatic Categorization of Statute Documents , 1997 .

[8]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[9]  Vipin Kumar,et al.  WebACE: a Web agent for document categorization and exploration , 1998, AGENTS '98.

[10]  George Karypis,et al.  Centroid-Based Document Classification Algorithms: Analysis & Experimental Results , 2000 .

[11]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[12]  Ellen Riloff,et al.  Information extraction as a basis for high-precision text classification , 1994, TOIS.

[13]  William W. Cohen,et al.  Joins that Generalize: Text Classification Using WHIRL , 1998, KDD.

[14]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[15]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[16]  G. W. Snedecor Statistical Methods , 1964 .

[17]  Minesh B. Amin,et al.  Generalization by Neural Networks , 1992, IEEE Trans. Knowl. Data Eng..

[18]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[19]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[20]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[21]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[22]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[23]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[24]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[25]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[26]  Wai Lam,et al.  Using a generalized instance set for automatic text categorization , 1998, SIGIR '98.

[27]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[28]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[29]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[30]  Robert H. Gross,et al.  Web Page Categorization and Feature Selection Using Association Rule and Principal Component Cluster , 1997 .

[31]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[32]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[33]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[34]  Takenobu Tokunaga,et al.  Cluster-based text categorization: a comparison of category search strategies , 1995, SIGIR '95.

[35]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[36]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[37]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[38]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[39]  Bernard Widrow,et al.  Adaptive Signal Processing , 1985 .

[40]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[41]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[42]  David G. Lowe,et al.  Similarity Metric Learning for a Variable-Kernel Classifier , 1995, Neural Computation.

[43]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[44]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[45]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[46]  Subhash C. Kak,et al.  On Generalization by Neural Networks , 1998, Inf. Sci..

[47]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[48]  Casimir A. Kulikowski,et al.  Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems , 1990 .

[49]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[50]  Walter Daelemans,et al.  Learnability and markedness in data-driven acquisition of stress , 1993 .