Class-indexing-based term weighting for automatic text classification

Most of the previous studies related on different term weighting emphasize on the document-indexing-based and four fundamental information elements-based approaches to address automatic text classification (ATC). In this study, we introduce class-indexing-based term-weighting approaches and judge their effects in high-dimensional and comparatively low-dimensional vector space over the TF.IDF and five other different term weighting approaches that are considered as the baseline approaches. First, we implement a class-indexing-based TF.IDF.ICF observational term weighting approach in which the inverse class frequency (ICF) is incorporated. In the experiment, we investigate the effects of TF.IDF.ICF over the Reuters-21578, 20 Newsgroups, and RCV1-v2 datasets as benchmark collections, which provide positive discrimination on rare terms in the vector space and biased against frequent terms in the text classification (TC) task. Therefore, we revised the ICF function and implemented a new inverse class space density frequency (ICS"@dF), and generated the TF.IDF.ICS"@dF method that provides a positive discrimination on infrequent and frequent terms. We present detailed evaluation of each category for the three datasets with term weighting approaches. The experimental results show that the proposed class-indexing-based TF.IDF.ICS"@dF term weighting approach is promising over the compared well-known baseline term weighting approaches.

[1]  Han Tong Loh,et al.  Imbalanced text classification: A term weighting approach , 2009, Expert Syst. Appl..

[2]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[3]  Wenqian Shang,et al.  A novel feature selection algorithm for text categorization , 2007, Expert Syst. Appl..

[4]  Rui Xia,et al.  Ensemble of feature sets and classification algorithms for sentiment classification , 2011, Inf. Sci..

[5]  David D. Lewis Text representation for intelligent text retrieval: a classification-oriented view , 1992 .

[6]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[7]  Flora S. Tsai,et al.  Experiments in term weighting for novelty mining , 2011, Expert Syst. Appl..

[8]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.

[9]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[10]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[11]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[12]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[13]  Wei Zhang,et al.  An Improvement to Naive Bayes for Text Classification , 2011 .

[14]  Andrea Tagarelli,et al.  Exploring dictionary-based semantic relatedness in labeled tree data , 2013, Inf. Sci..

[15]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[16]  Verayuth Lertnattee,et al.  Effect of term distributions on centroid-based text categorization , 2004, Inf. Sci..

[17]  Dino Isa,et al.  A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine , 2012, Expert Syst. Appl..

[18]  Gerard Salton,et al.  A theory of indexing , 1975, Regional conference series in applied mathematics.

[19]  Chris Buckley,et al.  A probabilistic learning approach for document indexing , 1991, TOIS.

[20]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[21]  Karen Spärck Jones Index term weighting , 1973, Inf. Storage Retr..

[22]  Gary Geunbae Lee,et al.  Information gain and divergence-based feature selection for machine learning-based text categorization , 2006, Inf. Process. Manag..

[23]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[24]  Kostas Tzeras,et al.  Automatic indexing based on Bayesian inference networks , 1993, SIGIR.

[25]  Richard Weber,et al.  Simultaneous feature selection and classification using kernel-penalized support vector machines , 2011, Inf. Sci..

[26]  Youngjoong Ko,et al.  Text classification from unlabeled documents with bootstrapping and feature projection techniques , 2009, Inf. Process. Manag..

[27]  Yi Guo,et al.  Automatic text categorization based on content analysis with cognitive situation models , 2010, Inf. Sci..

[28]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[29]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[30]  Andrew McCallum,et al.  Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[31]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[32]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[33]  Eduardo R. Hruschka,et al.  Towards improving cluster-based feature selection with a simplified silhouette filter , 2011, Inf. Sci..

[34]  Wagner Meira,et al.  Word co-occurrence features for text classification , 2011, Inf. Syst..

[35]  Songbo Tan,et al.  An improved centroid classifier for text categorization , 2008, Expert Syst. Appl..

[36]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[37]  Clement T. Yu,et al.  Contribution to the Theory of Indexing , 1973, IFIP Congress.

[38]  Kansheng Shi,et al.  Efficient text classification method based on improved term reduction and term weighting , 2011 .

[39]  Houkuan Huang,et al.  Feature selection for text classification with Naïve Bayes , 2009, Expert Syst. Appl..

[40]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Hiroshi Ogura,et al.  Comparison of metrics for feature selection in imbalanced text classification , 2011, Expert Syst. Appl..

[42]  Wen Li,et al.  Two-level hierarchical combination method for text classification , 2011, Expert Syst. Appl..

[43]  Joon Ho Lee,et al.  Combining multiple evidence from different properties of weighting schemes , 1995, SIGIR '95.

[44]  Sunita Sarawagi,et al.  Scaling multi-class support vector machines using inter-class confusion , 2002, KDD.

[45]  Thorsten Joachims,et al.  A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.

[46]  Bo-Yeong Kang,et al.  Document indexing: a concept-based approach to term weight estimation , 2005, Inf. Process. Manag..

[47]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[48]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[49]  Hui Xiong,et al.  A semantic term weighting scheme for text categorization , 2011, Expert Syst. Appl..