t-Test feature selection approach based on term frequency for text categorization

Abstract Feature selection techniques play an important role in text categorization (TC), especially for the large-scale TC tasks. Many new and improved methods have been proposed, and most of them are based on document frequency, such as the famous Chi-square statistic and information gain etc. These methods based on document frequency, however, have two shortcomings: (1) they are not reliable for low-frequency terms, that is, low-frequency terms will be filtered because of their smaller weights; and (2) they only count whether one term occurs within a document and ignore term frequency. Actually, high-frequency term (except stop words) occurred in few documents is often regards as a discriminators in the real-life corpus. Aimed at solving the above drawbacks, the paper focuses on how to construct a feature selection function based on term frequency, and proposes a new approach using student t-test. The t-test function is used to measure the diversity of the distributions of a term frequency between the specific category and the entire corpus. Extensive comparative experiments on two text corpora using three classifiers show that the proposed approach is comparable to the state-of-the-art feature selection methods in terms of macro- F 1 and micro- F 1 . Especially on micro- F 1 , our method achieves slightly better performance on Reuters with kNN and SVMs classifiers, compared to χ 2 , and IG.

[1]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[2]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[3]  J. Norris Appendix: probability and measure , 1997 .

[4]  Mill Johannes G.A. Van,et al.  Transmission Of Information , 1961 .

[5]  Deqing Wang,et al.  Feature selection based on term frequency and T-test for text categorization , 2012, CIKM.

[6]  Karl-Michael Schneider,et al.  A New Feature Selection Score for Multinomial Naive Bayes Text Classification Based on KL-Divergence , 2004, ACL.

[7]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[8]  Bing Liu,et al.  An efficient semi-unsupervised gene selection method via spectral biclustering , 2006, IEEE Transactions on NanoBioscience.

[9]  Wei Xie,et al.  Accurate Cancer Classification Using Expressions of Very Few Genes , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Ratna Babu Chinnam,et al.  mr2PSO: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification , 2011, Inf. Sci..

[11]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[13]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[14]  Eui-Hong,et al.  Centroid-Based Document Classifica tion : Analysis & Exper imental Results ∗ , 2000 .

[15]  Chao-Ton Su,et al.  Multiclass MTS for Simultaneous Feature Selection and Classification , 2009, IEEE Transactions on Knowledge and Data Engineering.

[16]  Deqing Wang,et al.  Predicting Bugs' Components via Mining Bug Reports , 2012, J. Softw..

[17]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[18]  Lipo Wang,et al.  Feature Selection Based on the Rough Set Theory and Expectation-Maximization Clustering Algorithm , 2008, RSCTC.

[19]  J. Lindeberg Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeitsrechnung , 1922 .

[20]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[21]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[22]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[23]  Junjie Wu,et al.  Towards enhancing centroid classifier for text classification - A border-instance approach , 2013, Neurocomputing.

[24]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[25]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[26]  Lipo Wang Support vector machines : theory and applications , 2005 .

[27]  Hsin-Hsi Chen,et al.  Emotion Classification of Online News Articles from the Reader's Perspective , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[28]  Paul D. Minton,et al.  Statistics: The Exploration and Analysis of Data , 2002, Technometrics.

[29]  Byung Ro Moon,et al.  Hybrid Genetic Algorithms for Feature Selection , 2004, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[31]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[32]  Karl-Michael Schneider,et al.  Weighted Average Pointwise Mutual Information for Feature Selection in Text Categorization , 2005, PKDD.

[33]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[34]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[35]  Lipo Wang,et al.  A Modified T-test Feature Selection Method and Its Application on the HapMap Genotype Data , 2008, Genom. Proteom. Bioinform..

[36]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[37]  Ming Zhang,et al.  Relative term-frequency based feature selection for text categorization , 2002, Proceedings. International Conference on Machine Learning and Cybernetics.

[38]  David Burshtein,et al.  Support Vector Machine Training for Improved Hidden Markov Modeling , 2008, IEEE Transactions on Signal Processing.

[39]  Willy Feller Über den zentralen Grenzwertsatz der Wahrscheinlichkeitsrechnung , 1936 .

[40]  Harun Uguz,et al.  A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm , 2011, Knowl. Based Syst..

[41]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..