论文信息 - Relative term-frequency based feature selection for text categorization

Relative term-frequency based feature selection for text categorization

Automatic feature selection methods such as document frequency, information gain, mutual information and so on are commonly applied in the preprocess of text categorization in order to reduce the originally high feature dimension to a bearable level, meanwhile also reduce the noise to improve precision. Generally they assess a specific term by calculating its occurrences among individual categories or in the entire corpus, where "occurring in a document" is simply defined as occurring at least once. A major drawback of this measure is that, for a single document, it might count a recurrent term the same as a rare term, while the former term is obviously more informative and should less likely be removed. In this paper we propose a possible approach to overcome this problem, which adjusts the occurrences count according to the relative term frequency, thus stressing those recurrent words in each document. While it can be applied to all feature selection methods, we implemented it on several of them and see notable improvements in the performances.

Ming Zhang | Dongqing Yang | Zhi-Hong Deng | S. M. Yang | Xiao-Bin Wu

[1] Mill Johannes G.A. Van,et al. Transmission Of Information , 1961 .

[2] Yiming Yang,et al. Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[3] Hinrich Schütze,et al. A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[4] J. Ross Quinlan,et al. Induction of Decision Trees , 1986, Machine Learning.

[5] David D. Lewis,et al. A comparison of two learning algorithms for text categorization , 1994 .

[6] Kenneth Ward Church,et al. Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[7] Yiming Yang,et al. A re-examination of text categorization methods , 1999, SIGIR '99.

[8] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[9] Yiming Yang,et al. Using Corpus Statistics to Remove Redundant Words in Text Categorization , 1996, J. Am. Soc. Inf. Sci..

[10] Andreas S. Weigend,et al. A neural network approach to topic spotting , 1995 .

[11] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[12] David D. Lewis,et al. Representation and Learning in Information Retrieval , 1991 .

[13] Sholom M. Weiss,et al. Automated learning of decision rules for text categorization , 1994, TOIS.