Robust Algorithms for Combining Multiple Term Weighting Vectors for Document Classification

Term weighting is a popular technique that effectively weighs the term features to improve accuracy in document classification. While several successful term weighting algorithms have been suggested, none of them appears to perform well consistently across different data domains. In this paper we propose several reasonable methods to combine different term weight vectors to yield a robust document classifier that performs consistently well on diverse datasets. Specifically we suggest two approaches: i) learning a single weight vector that lies in a convex hull of the base vectors while minimizing the class prediction loss, and ii) a mini-max classifier that aims for robustness of the individual weight vectors by minimizing the loss of the worst-performing strategy among the base vectors. We provide efficient solution methods for these optimization problems. The effectiveness and robustness of the proposed approaches are demonstrated on several benchmark document datasets, significantly outperforming the existing term weighting methods.

[1]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[2]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[3]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[4]  Shiwei Tang,et al.  A Comparative Study on Feature Weight in Text Categorization , 2004, APWeb.

[5]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[6]  Upasana Pandey,et al.  A Survey on Text Classification Techniques for E-mail Filtering , 2010, 2010 Second International Conference on Machine Learning and Computing.

[7]  Abu Nowshed Chy,et al.  Bangla news classification using naive Bayes classifier , 2014, 16th Int'l Conf. Computer and Information Technology.

[8]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[9]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[10]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[11]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[13]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.