Model-induced term-weighting schemes for text classification

The bag-of-words representation of text data is very popular for document classification. In the recent literature, it has been shown that properly weighting the term feature vector can improve the classification performance significantly beyond the original term-frequency based features. In this paper we demystify the success of the recent term-weighting strategies as well as provide possibly more reasonable modifications. We then propose novel term-weighting schemes that can be induced from the well-known document probabilistic models such as the Naive Bayes and the multinomial term model. Interestingly, some of the intuition-based term-weighting schemes coincide exactly with the proposed derivations. Our term-weighting schemes are tested on large-scale text classification problems/datasets where we demonstrate improved prediction performance over existing approaches.

[1]  Youquan He,et al.  An improved Naive Bayesian algorithm for Web page text classification , 2011, FSKD.

[2]  Shiwei Tang,et al.  A Comparative Study on Feature Weight in Text Categorization , 2004, APWeb.

[3]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[4]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[5]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[6]  Ting Wang,et al.  Online supervised learning from multi-field documents for email spam filtering , 2010, 2010 International Conference on Machine Learning and Cybernetics.

[7]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[8]  Kai Yang,et al.  News clustering system based on text mining , 2010, 2010 IEEE International Conference on Advanced Management Science(ICAMS 2010).

[9]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[10]  Bahar Karaoglan,et al.  A nonparametric term weighting method for information retrieval based on measuring the divergence from independence , 2014, Information Retrieval.

[11]  Hieu Le Quang,et al.  A New Improved Term Weighting Scheme for Text Categorization , 2013, KSE.

[12]  Guy W. Mineau,et al.  Beyond TFIDF Weighting for Text Categorization in the Vector Space Model , 2005, IJCAI.

[13]  Eugene Semenkin,et al.  Automatically generated classifiers for opinion mining with different term weighting schemes , 2014, 2014 11th International Conference on Informatics in Control, Automation and Robotics (ICINCO).

[14]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[15]  Hua Jiang,et al.  An improved method of term weighting for text classification , 2009, 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems.

[16]  Hongliang Yu,et al.  A study of supervised term weighting scheme for sentiment analysis , 2014, Expert Syst. Appl..

[17]  Hugo Jair Escalante,et al.  Term-weighting learning via genetic programming for text classification , 2014, Knowl. Based Syst..

[18]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[19]  Idan Szpektor,et al.  Improving Term Weighting for Community Question Answering Search Using Syntactic Analysis , 2014, CIKM.

[20]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[21]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Youngjoong Ko,et al.  A study of term weighting schemes using class information for text classification , 2012, SIGIR '12.

[23]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[24]  Upasana Pandey,et al.  A Survey on Text Classification Techniques for E-mail Filtering , 2010, 2010 Second International Conference on Machine Learning and Computing.

[25]  Chew Lim Tan,et al.  Proposing a New Term Weighting Scheme for Text Categorization , 2006, AAAI.

[26]  Mohamed Abdel Fattah,et al.  New term weighting schemes with combination of multiple classifiers for sentiment analysis , 2015, Neurocomputing.

[27]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[28]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[29]  David D. Lewis,et al.  Threading Electronic Mail - A Preliminary Study , 1997, Inf. Process. Manag..

[30]  Abu Nowshed Chy,et al.  Bangla news classification using naive Bayes classifier , 2014, 16th Int'l Conf. Computer and Information Technology.

[31]  Tina Eliassi-Rad,et al.  Intelligent Agents for Web-based Tasks: An Advice-Taking Approach , 1998 .

[32]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[33]  Ana Margarida de Jesus,et al.  Improving Methods for Single-label Text Categorization , 2007 .

[34]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[35]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[36]  David G. Stork,et al.  Pattern Classification , 1973 .