Comparing feature sets for learning text categorization

This paper describes an experimental study of feature selection in statistical learning of text categorization. We used the χ2-statistic to select the most distinguishing terms, term-bigrams, and term-trigrams as the text features. We found that applying syntactic restrictions to the bigrams and trigrams as an additional selection method enhances precision. Combining terms, bigrams, and trigrams into one feature set leads to a substantial improvement of the categorization results. We evaluated two machine learning algorithms for the task of text categorization: TiMBL (a memory-based classifier) and c5.0 (a decision tree learning algorithm).

[1]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[2]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[3]  Ellen Riloff,et al.  Information extraction as a basis for high-precision text classification , 1994, TOIS.

[4]  Walter Daelemans,et al.  IGTree: Using Trees for Compression and Classification in Lazy Learning Algorithms , 1997, Artificial Intelligence Review.

[5]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[6]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[7]  Mathias Kirsten,et al.  Exploring the Use of Linguistic Features in Domain and Genre Classification , 1999, EACL.

[8]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[9]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[10]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[11]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[12]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[13]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[14]  Walter Daelemans,et al.  Generalization performance of backpropagation learning on a syllabification task , 1992 .

[15]  Ellen Riloff,et al.  Extraction-based Text Categorization: Generating Domain-specific Role Relationships , 1999 .

[16]  Ellen Riloff,et al.  Using learned extraction patterns for text classification , 1995, Learning for Natural Language Processing.

[17]  Walter Daelemans,et al.  TiMBL: Tilburg Memory-Based Learner, version 2.0, Reference guide , 1998 .

[18]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[19]  D. Collett Modelling Binary Data , 1991 .

[20]  Richard M. Tong,et al.  Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment) , 1993, TREC.