论文信息 - t-Test feature selection approach based on term frequency for text categorization

t-Test feature selection approach based on term frequency for text categorization

Abstract Feature selection techniques play an important role in text categorization (TC), especially for the large-scale TC tasks. Many new and improved methods have been proposed, and most of them are based on document frequency, such as the famous Chi-square statistic and information gain etc. These methods based on document frequency, however, have two shortcomings: (1) they are not reliable for low-frequency terms, that is, low-frequency terms will be filtered because of their smaller weights; and (2) they only count whether one term occurs within a document and ignore term frequency. Actually, high-frequency term (except stop words) occurred in few documents is often regards as a discriminators in the real-life corpus. Aimed at solving the above drawbacks, the paper focuses on how to construct a feature selection function based on term frequency, and proposes a new approach using student t-test. The t-test function is used to measure the diversity of the distributions of a term frequency between the specific category and the entire corpus. Extensive comparative experiments on two text corpora using three classifiers show that the proposed approach is comparable to the state-of-the-art feature selection methods in terms of macro- F 1 and micro- F 1 . Especially on micro- F 1 , our method achieves slightly better performance on Reuters with kNN and SVMs classifiers, compared to χ 2 , and IG.

[1] Dunja Mladenic,et al. Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[2] Andrew McCallum,et al. A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[3] J. Norris. Appendix: probability and measure , 1997 .

[4] Mill Johannes G.A. Van,et al. Transmission Of Information , 1961 .

[5] Deqing Wang,et al. Feature selection based on term frequency and T-test for text categorization , 2012, CIKM.

[6] Karl-Michael Schneider,et al. A New Feature Selection Score for Multinomial Naive Bayes Text Classification Based on KL-Divergence , 2004, ACL.

[7] David D. Lewis,et al. Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[8] Bing Liu,et al. An efficient semi-unsupervised gene selection method via spectral biclustering , 2006, IEEE Transactions on NanoBioscience.

[9] Wei Xie,et al. Accurate Cancer Classification Using Expressions of Very Few Genes , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10] Ratna Babu Chinnam,et al. mr2PSO: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification , 2011, Inf. Sci..

[11] R. Tibshirani,et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[12] Ted Dunning,et al. Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[13] Student,et al. THE PROBABLE ERROR OF A MEAN , 1908 .

[14] Eui-Hong,et al. Centroid-Based Document Classifica tion : Analysis & Exper imental Results ∗ , 2000 .

[15] Chao-Ton Su,et al. Multiclass MTS for Simultaneous Feature Selection and Classification , 2009, IEEE Transactions on Knowledge and Data Engineering.

[16] Deqing Wang,et al. Predicting Bugs' Components via Mining Bug Reports , 2012, J. Softw..

[17] Daphne Koller,et al. Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[18] Lipo Wang,et al. Feature Selection Based on the Rough Set Theory and Expectation-Maximization Clustering Algorithm , 2008, RSCTC.

[19] J. Lindeberg. Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeitsrechnung , 1922 .

[20] Isabelle Guyon,et al. An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[21] George Karypis,et al. Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[22] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[23] Junjie Wu,et al. Towards enhancing centroid classifier for text classification - A border-instance approach , 2013, Neurocomputing.

[24] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[25] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[26] Lipo Wang. Support vector machines : theory and applications , 2005 .

[27] Hsin-Hsi Chen,et al. Emotion Classification of Online News Articles from the Reader's Perspective , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[28] Paul D. Minton,et al. Statistics: The Exploration and Analysis of Data , 2002, Technometrics.

[29] Byung Ro Moon,et al. Hybrid Genetic Algorithms for Feature Selection , 2004, IEEE Trans. Pattern Anal. Mach. Intell..

[30] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[31] J. Ross Quinlan,et al. Induction of Decision Trees , 1986, Machine Learning.

[32] Karl-Michael Schneider,et al. Weighted Average Pointwise Mutual Information for Feature Selection in Text Categorization , 2005, PKDD.

[33] Chih-Jen Lin,et al. LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[34] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[35] Lipo Wang,et al. A Modified T-test Feature Selection Method and Its Application on the HapMap Genotype Data , 2008, Genom. Proteom. Bioinform..

[36] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[37] Ming Zhang,et al. Relative term-frequency based feature selection for text categorization , 2002, Proceedings. International Conference on Machine Learning and Cybernetics.

[38] David Burshtein,et al. Support Vector Machine Training for Improved Hidden Markov Modeling , 2008, IEEE Transactions on Signal Processing.

[39] Willy Feller. Über den zentralen Grenzwertsatz der Wahrscheinlichkeitsrechnung , 1936 .

[40] Harun Uguz,et al. A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm , 2011, Knowl. Based Syst..

[41] George Forman,et al. An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..