Inductive learning algorithms and representations for text categorization

1. ABSTRACT Text categorization – the assignment of natural language texts to one or more predefined categories based on their content – is an important component in many information organization and management tasks. We compare the effectiveness of five different automatic learning algorithms for text categorization in terms of learning speed, realtime classification speed, and classification accuracy. We also examine training set size, and alternative document representations. Very accurate text classifiers can be learned automatically from training examples. Linear Support Vector Machines (SVMs) are particularly promising because they are very accurate, quick to train, and quick to evaluate. 1.1

[1]  Irving John Good,et al.  The Estimation of Probabilities: An Essay on Modern Bayesian Methods , 1965 .

[2]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[3]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[4]  Philip J. Hayes,et al.  CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories , 1990, IAAI.

[5]  Norbert Fuhr,et al.  AIR/X - A rule-based multistage indexing system for Iarge subject fields , 1991, RIAO.

[6]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[7]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[8]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[9]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[10]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[11]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[12]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[13]  Harris Drucker,et al.  Learning algorithms for classification: A comparison on handwritten digit recognition , 1995 .

[14]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[15]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[16]  Mehran Sahami,et al.  Learning Limited Dependence Bayesian Classifiers , 1996, KDD.

[17]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[18]  D LewisDavid,et al.  Natural language processing for information retrieval , 1996 .

[19]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[20]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[21]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  David Maxwell Chickering,et al.  A Bayesian Approach to Learning Bayesian Networks with Local Structure , 1997, UAI.

[23]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[24]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[25]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[26]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[27]  John Platt,et al.  Fast training of svms using sequential minimal optimization , 1998 .

[28]  William W. Cohen,et al.  Context-sensitive learning methods for text categorization , 1999, TOIS.

[29]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.