A new term‐weighting scheme for text classification using the odds of positive and negative class probabilities

Text classification (TC) is a core technique for text mining and information retrieval. It has been applied to many applications in many different research and industrial areas. Term‐weighting schemes assign an appropriate weight to each term to obtain a high TC performance. Although term weighting is one of the important modules for TC and TC has different peculiarities from those in information retrieval, many term‐weighting schemes used in information retrieval, such as term frequency–inverse document frequency (tf–idf), have been used in TC in the same manner. The peculiarity of TC that differs most from information retrieval is the existence of class information. This article proposes a new term‐weighting scheme that uses class information using positive and negative class distributions. As a result, the proposed scheme, log tf–TRR, consistently performs better than do other schemes using class information as well as traditional schemes such as tf–idf.

[1]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[2]  Michael D. Gordon,et al.  A utility theoretic examination of the probability ranking principle in information retrieval , 1991, J. Am. Soc. Inf. Sci..

[3]  Mike Thelwall,et al.  A Study of Information Retrieval Weighting Schemes for Sentiment Analysis , 2010, ACL.

[4]  Lawrence D. Fu,et al.  A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization , 2014, J. Assoc. Inf. Sci. Technol..

[5]  Guy W. Mineau,et al.  Beyond TFIDF Weighting for Text Categorization in the Vector Space Model , 2005, IJCAI.

[6]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[7]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[8]  Weimao Ke,et al.  Collaborative hierarchical clustering in the browser for scatter/gather on the web , 2012, ASIST.

[9]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[10]  Jean-Michel Renders,et al.  Large-scale hierarchical text classification without labelled data , 2011, WSDM '11.

[11]  Chew Lim Tan,et al.  Proposing a New Term Weighting Scheme for Text Categorization , 2006, AAAI.

[12]  Youngjoong Ko,et al.  Text classification from unlabeled documents with bootstrapping and feature projection techniques , 2009, Inf. Process. Manag..

[13]  KoYoungjoong,et al.  Using the revised EM algorithm to remove noisy data for improving the one-against-the-rest method in binary text classification , 2007 .

[14]  Carlo Strapparava,et al.  Investigating Unsupervised Learning for Text Categorization Bootstrapping , 2005, HLT/EMNLP.

[15]  Tom M. Mitchell,et al.  Using unlabeled data to improve text classification , 2001 .

[16]  Jinwoo Park,et al.  Improving text categorization using the importance of sentences , 2004, Inf. Process. Manag..

[17]  Simon Dobrisek,et al.  An Edit-Distance Model for the Approximate Matching of Timed Strings , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Youngjoong Ko,et al.  A study of term weighting schemes using class information for text classification , 2012, SIGIR '12.

[19]  Weimao Ke Least information document representation for automated text classification , 2012, ASIST.

[20]  Shiwei Tang,et al.  A Comparative Study on Feature Weight in Text Categorization , 2004, APWeb.

[21]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[22]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[24]  ChengXiang Zhai,et al.  A two-stage approach to domain adaptation for statistical classifiers , 2007, CIKM '07.

[25]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[26]  Jiawei Han,et al.  Text classification from positive and unlabeled documents , 2003, CIKM '03.

[27]  Michael D. Gordon,et al.  When Is the Probability Ranking Principle Suboptimal? , 1992, J. Am. Soc. Inf. Sci..

[28]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[29]  Gary Geunbae Lee,et al.  An effective procedure for constructing a hierarchical text classification system , 2006, J. Assoc. Inf. Sci. Technol..

[30]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[31]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[32]  Ee-Peng Lim,et al.  On strategies for imbalanced text classification using SVM: A comparative study , 2009, Decis. Support Syst..

[33]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[34]  Arindam Banerjee,et al.  Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning , 2007, SDM.

[35]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[36]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[37]  Francisco Herrera,et al.  An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes , 2011, Pattern Recognit..

[38]  Ee-Peng Lim,et al.  Performance measurement framework for hierarchical text classification , 2003, J. Assoc. Inf. Sci. Technol..

[39]  Timothy W. Finin,et al.  Delta TFIDF: An Improved Feature Space for Sentiment Analysis , 2009, ICWSM.