Entropy-Based Term Weighting Schemes for Text Categorization in VSM

Term weighting schemes have been widely used in information retrieval and text categorization models. In this paper, we first investigate into the limitations of several state-of-the-art term weighting schemes in the context of text categorization tasks. Considering that category-specific terms are more useful to discriminate different categories, and these terms tend to have smaller entropy with respect to these categories, we then explore the relationship between a term's discriminating power and its entropy with respect to a set of categories. To this end, we propose two entropy-based term weighting schemes (i.e., tf.dc and tf.bdc) which measure the discriminating power of a term based on its global distributional concentration in the categories of a corpus. To demonstrate the effectiveness of the proposed term weighting schemes, we compare them with seven state-of-the-art schemes on a long-text corpus and a short-text corpus respectively. Our experimental results show that the proposed schemes outperform the state-of-the-art schemes in text categorization tasks with KNN and SVM.

[1]  Wenyin Liu,et al.  Term Weighting Schemes for Question Categorization , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[3]  Youngjoong Ko,et al.  A new term‐weighting scheme for text classification using the odds of positive and negative class probabilities , 2015, J. Assoc. Inf. Sci. Technol..

[4]  Paul Rayson,et al.  Extending the Cochran rule for the comparison of word frequencies between corpora , 2004 .

[5]  LuYue,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009 .

[6]  Bin Cao,et al.  Short text classification by detecting information path , 2013, CIKM.

[7]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[8]  Guy W. Mineau,et al.  Beyond TFIDF Weighting for Text Categorization in the Vector Space Model , 2005, IJCAI.

[9]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[10]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[11]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[12]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[13]  Milos Hauskrecht,et al.  Boosting KNN text classification accuracy by using supervised term weighting schemes , 2009, CIKM.

[14]  Bin Zheng,et al.  Research Paper: Enhancing Text Categorization with Semantic-enriched Representation and Training Data Augmentation , 2006, J. Am. Medical Informatics Assoc..

[15]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[16]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[17]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[18]  Christophe Moulin,et al.  Entropy based feature selection for text categorization , 2011, SAC.

[19]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[20]  Jiaul H. Paik A novel TF-IDF weighting scheme for effective ranking , 2013, SIGIR.

[21]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .