论文信息 - Feature Selection and Feature Extraction for Text Categorization

Feature Selection and Feature Extraction for Text Categorization

The effect of selecting varying numbers and kinds of features for use in predicting category membership was investigated on the Reuters and MUC-3 text categorization data sets. Good categorization performance was achieved using a statistical classifier and a proportional assignment strategy. The optimal feature set size for word-based indexing was found to be surprisingly low (10 to 15 features) despite the large training sets. The extraction of new text features by syntactic analysis and feature clustering was investigated on the Reuters data set. Syntactic indexing phrases, clusters of these phrases, and clusters of words were all found to provide less effective representations than individual words.

David D. Lewis | D. Lewis

[1] David D. Lewis,et al. Representation and Learning in Information Retrieval , 1991 .

[2] Kenneth Ward Church. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[3] Norbert Fuhr,et al. Models for retrieval with probabilistic indexing , 1989, Inf. Process. Manag..

[4] Richard W. Hamming,et al. Coding and Information Theory , 1980 .

[5] Richard O. Duda,et al. Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[6] David D. Lewis,et al. Data extraction as text categorization: an experiment with the MUC-3 corpus , 1991, MUC.

[7] W. Bruce Croft,et al. Term clustering of syntactic phrases , 1989, SIGIR '90.

[8] Philip J. Hayes,et al. CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories , 1990, IAAI.

[9] David D. Lewis,et al. Evaluating Text Categorization I , 1991, HLT.

[10] Norbert Fuhr,et al. AIR/X - A rule-based multistage indexing system for Iarge subject fields , 1991, RIAO.