Proposing a New Term Weighting Scheme for Text Categorization

In text categorization, term weighting methods assign appropriate weights to the terms to improve the classification performance. In this study, we propose an effective term weighting scheme, i.e. tf.rf, and investigate several widely-used unsupervised and supervised term weighting methods on two popular data collections in combination with SVM and kNN algorithms. From our controlled experimental results, not all supervised term weighting methods have a consistent superiority over unsupervised term weighting methods. Specifically, the three supervised methods based on the information theory, i.e. tf.χ2, tf.ig and tf.or, perform rather poorly in all experiments. On the other hand, our proposed tf.rf achieves the best performance consistently and outperforms other methods substantially and significantly. The popularly-used tf.idf method has not shown a uniformly good performance with respect to different data corpora.

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[3]  Marko Grobelnik,et al.  Feature selection using linear classifier weights: interaction with classification models , 2004, SIGIR '04.

[4]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[5]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[6]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[7]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[8]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[9]  Tatsunori Mori,et al.  Information Gain Ratio as Term Weight: The case of Summarization of IR Results , 2002, COLING.

[10]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[11]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[12]  Gerard Salton,et al.  A comparison of search term weighting: term relevance vs. inverse document frequency , 1981, SIGIR '81.

[13]  Gerard Salton,et al.  A comparison of search term weighting: term relevance vs. inverse document frequency , 1981, SIGIR 1981.

[14]  Man Lan,et al.  A comparative study on term weighting schemes for text categorization , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[15]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[16]  Shiwei Tang,et al.  A Comparative Study on Feature Weight in Text Categorization , 2004, APWeb.