Improving Text Classification Performance with Random Forests-Based Feature Selection

Feature selection (FS) is employed to make text classification (TC) more effective. Well-known FS metrics like information gain (IG) and odds ratio (OR) rank terms without considering term interactions. Building classifiers with FS algorithms considering term interactions can yield better performance. But their computational complexity is a concern. This has resulted in two-stage algorithms such as information gain-principal component analysis (IG–PCA). Random forests-based feature selection (RFFS), proposed by Breiman, has demonstrated outstanding performance while capturing gene–gene relations in bioinformatics, but its usefulness for TC is less explored. RFFS has fewer control parameters and is found to be resistant to overfitting and thus generalizes well to new data. It does not require use of a test dataset to report accuracy and does not use conventional cross-validation. This paper investigates the working of RFFS for TC and compares its performance against IG, OR and IG–PCA. We carry out experiments on four widely used text data sets using naive Bayes’ and support vector machines as classifiers. RFFS achieves macro-F1 values higher than other FS algorithms in 73 % of the experimental instances. We also analyze the performance of RFFS for TC in terms of its parameters and class skews of the data sets and yield interesting results.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[3]  Kashif Javed,et al.  A two-stage Markov blanket based feature selection algorithm for text classification , 2015, Neurocomputing.

[4]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[5]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[6]  Jean-Michel Poggi,et al.  Variable selection using random forests , 2010, Pattern Recognit. Lett..

[7]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[8]  Harun Uguz,et al.  A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm , 2011, Knowl. Based Syst..

[9]  Hakan Altinçay,et al.  A novel framework for termset selection and weighting in binary text classification , 2014, Eng. Appl. Artif. Intell..

[10]  D. R. Cutler,et al.  Utah State University From the SelectedWorks of , 2017 .

[11]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[12]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[13]  Masoud Nikravesh,et al.  Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[14]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[15]  Alexander Hapfelmeier,et al.  A new variable selection approach using Random Forests , 2013, Comput. Stat. Data Anal..

[16]  R. E. Abdel-Aal,et al.  GMDH-based feature ranking and selection for improved classification of medical data , 2005, J. Biomed. Informatics.

[17]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[18]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[19]  Houkuan Huang,et al.  Feature selection for text classification with Naïve Bayes , 2009, Expert Syst. Appl..

[20]  José Ranilla,et al.  Scoring and selecting terms for text categorization , 2005, IEEE Intelligent Systems.

[21]  Kashif Javed,et al.  Machine learning using Bernoulli mixture models: Clustering, rule extraction and dimensionality reduction , 2013, Neurocomputing.

[22]  Hongfei Lin,et al.  A two-stage feature selection method for text categorization , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[23]  Yung-Seop Lee,et al.  Enriched random forests , 2008, Bioinform..

[24]  Serkan Günal,et al.  A novel probabilistic feature selection method for text classification , 2012, Knowl. Based Syst..

[25]  Sanmay Das,et al.  Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection , 2001, ICML.

[26]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[27]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[28]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[29]  Ana Margarida de Jesus,et al.  Improving Methods for Single-label Text Categorization , 2007 .

[30]  Isabelle Guyon,et al.  Multivariate Non-Linear Feature Selection with Kernel Methods , 2005 .

[31]  A. G. Heidema,et al.  A framework to identify physiological responses in microarray-based gene expression studies: selection and interpretation of biologically relevant genes. , 2008, Physiological genomics.

[32]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[33]  Mariza de Andrade,et al.  Identification of genes and haplotypes that predict rheumatoid arthritis using random forests , 2009, BMC proceedings.

[34]  Serkan Günal,et al.  Text classification using genetic algorithm oriented latent semantic features , 2014, Expert Syst. Appl..

[35]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[36]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[37]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[38]  Kashif Javed,et al.  Impact of a metric of association between two variables on performance of filters for binary data , 2014, Neurocomputing.

[39]  Masoud Makrehchi Feature Ranking for Text Classifiers , 2007 .

[40]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[41]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[42]  Kashif Javed,et al.  Feature Selection Based on Class-Dependent Densities for High-Dimensional Binary Data , 2012, IEEE Transactions on Knowledge and Data Engineering.

[43]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.