Explicit Use of Term Occurrence Probabilities for Term Weighting in Text Categorization

In this paper, the behaviors of leading symmetric and asymmetric term weighting schemes are analyzed in the context of text categorization. This analysis includes their weighting patterns in the two dimensional term occurrence probability space and the dynamic ranges of the generated weights. Additionally, one of the newly proposed term selection schemes, multi-class odds ratio, is considered as a potential symmetric weighting scheme. Based on the findings of this study, a novel symmetric weighting scheme derived as a function of term occurrence probabilities is proposed. The experiments conducted on Reuters-21578 ModApte Top10, WebKB, 7-Sectors and CSTR 2009 datasets indicate that the proposed scheme outperforms other leading schemes in terms of macro-averaged and micro-averaged F1 scores.

[1]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[2]  Chris Buckley,et al.  Implementation of the SMART Information Retrieval System , 1985 .

[3]  Dunja Mladenic,et al.  Feature selection on hierarchy of web documents , 2003, Decis. Support Syst..

[4]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[5]  Ran El-Yaniv,et al.  Distributional Word Clusters vs. Words for Text Categorization , 2003, J. Mach. Learn. Res..

[6]  Ali Selamat,et al.  Web page feature selection and classification using neural networks , 2004, Inf. Sci..

[7]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[8]  Houkuan Huang,et al.  Feature selection for text classification with Naïve Bayes , 2009, Expert Syst. Appl..

[9]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Han Tong Loh,et al.  Imbalanced text classification: A term weighting approach , 2009, Expert Syst. Appl..

[11]  Han Tong Loh,et al.  Using Redundancy Reduction in Summarization to Improve Text Classification by SVMs , 2009, J. Inf. Sci. Eng..

[12]  Giorgio Maria Di Nunzio Using scatterplots to understand and improve probabilistic models for text categorization and retrieval , 2009, Int. J. Approx. Reason..

[13]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[14]  Songbo Tan,et al.  Neighbor-weighted K-nearest neighbor for unbalanced text corpus , 2005, Expert Syst. Appl..

[15]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[16]  Fabrizio Sebastiani,et al.  An Analysis of the Relative Hardness of Reuters-21578 Subsets , 2003 .

[17]  Hiroshi Ogura,et al.  Feature selection with a measure of deviations from Poisson in text categorization , 2009, Expert Syst. Appl..

[18]  Zhi-Hua Zhou,et al.  Distributional features for text categorization , 2006 .

[19]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[20]  E. Varoglu,et al.  A symmetric term weighting scheme for text categorization based on term occurrence probabilities , 2009, 2009 Fifth International Conference on Soft Computing, Computing with Words and Perceptions in System Analysis, Decision and Control.

[21]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[22]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[23]  Chew Lim Tan,et al.  Proposing a New Term Weighting Scheme for Text Categorization , 2006, AAAI.

[24]  Shenghuo Zhu,et al.  Text categorization via generalized discriminant analysis , 2008, Inf. Process. Manag..