A comparison of two learning algorithms for text categorization

This paper examines the use of inductive learning to categorize natural language documents into predeened content categories. Categorization of text is of increasing importance in information retrieval and natural language processing systems. Previous research on automated text categorization has mixed machine learning and knowledge engineering methods, making it diicult to draw conclusions about the performance of particular methods. In this paper we present empirical results on the performance of a Bayesian classiier and a decision tree learning algorithm on two text categorization data sets. We nd that both algorithms achieve reasonable performance and allow controlled tradeoos between false positives and false negatives. The stepwise feature selection in the decision tree algorithm is particularly eeective in dealing with the large feature sets common in text categorization. However, even this algorithm is aided by an initial preeltering of features, connrming the results found by Almuallim and Dietterich on artiicial data sets. We also demonstrate the impact of the time-varying nature of category deenitions.

[1]  M. E. Maron,et al.  Automatic Indexing: An Experimental Inquiry , 1961, JACM.

[2]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  Paul E. Utgoff,et al.  Shift of bias for inductive concept learning , 1984 .

[4]  Natasha Vleduts-Stokolov Concept recognition in an automatic text‐processing system for the life sciences , 1987 .

[5]  Philip J. Hayes,et al.  CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories , 1990, IAAI.

[6]  Lisa F. Rau,et al.  SCISOR: extracting information from on-line news , 1990, CACM.

[7]  Kenneth Ward Church,et al.  Poor Estimates of Context are Worse than None , 1990, HLT.

[8]  David D. Lewis,et al.  Data extraction as text categorization: an experiment with the MUC-3 corpus , 1991, MUC.

[9]  Cyril W. Cleverdon,et al.  The significance of the Cranfield tests on index languages , 1991, SIGIR '91.

[10]  David D. Lewis,et al.  Evaluating Text Categorization I , 1991, HLT.

[11]  Edward P. Stabler,et al.  ITP Interpretext system: MUC-3 test results and analysis , 1991, MUC.

[12]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[13]  Ralph Grishman,et al.  New York University: Description of the PROTEUS System as Used for MUC-3 , 1991, MUC.

[14]  Charles P. Dolan,et al.  Hughes Trainable Text Skimmer: description of the TTS system as used for MUC-3 , 1991, MUC.

[15]  Wray L. Buntine,et al.  Introduction in IND and recursive partitioning , 1991 .

[16]  Richard M. Tong,et al.  Advanced Decision Systems: description of the CODEX system as used for MUC-3 , 1991, MUC.

[17]  Stuart L. Crawford,et al.  Classification Trees for Information Retrieval , 1991, ML.

[18]  David L. Waltz,et al.  Trading MIPS and memory for knowledge engineering , 1992, CACM.

[19]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.