The All Relevant Feature Selection using Random Forest

In this paper we examine the application of the random forest classifier for the all relevant feature selection problem. To this end we first examine two recently proposed all relevant feature selection algorithms, both being a random forest wrappers, on a series of synthetic data sets with varying size. We show that reasonable accuracy of predictions can be achieved and that heuristic algorithms that were designed to handle the all relevant problem, have performance that is close to that of the reference ideal algorithm. Then, we apply one of the algorithms to four families of semi-synthetic data sets to assess how the properties of particular data set influence results of feature selection. Finally we test the procedure using a well-known gene expression data set. The relevance of nearly all previously established important genes was confirmed, moreover the relevance of several new ones is discovered.

[1]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[2]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[3]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[4]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[5]  Zixiang Xiong,et al.  Optimal number of features as a function of sample size for various classification rules , 2005, Bioinform..

[6]  Jan Komorowski,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm486 Data and text mining Monte Carlo , 2022 .

[7]  Michael J. Pazzani,et al.  Collaborative Filtering with the Simple Bayesian Classifier , 2000, PRICAI.

[8]  Jan Komorowski,et al.  A Statistical Method for Determining Importance of Variables in an Information System , 2006, RSCTC.

[9]  Jesper Tegnér,et al.  Consistent Feature Selection for Pattern Recognition in Polynomial Time , 2007, J. Mach. Learn. Res..

[10]  Pierre Geurts,et al.  Exploiting tree-based variable importances to selectively identify relevant variables , 2008, FSDM.

[11]  Alex Aussem,et al.  A novel Markov boundary based feature subset selection algorithm , 2010, Neurocomputing.

[12]  Jan Paul Siebert,et al.  Vehicle Recognition Using Rule Based Methods , 1987 .

[13]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[14]  J. Friedman,et al.  Estimating Optimal Transformations for Multiple Regression and Correlation. , 1985 .

[15]  Witold R. Rudnicki,et al.  Feature Selection with the Boruta Package , 2010 .

[16]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[17]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[18]  George C. Runger,et al.  Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination , 2009, J. Mach. Learn. Res..

[19]  Witold R. Rudnicki,et al.  Boruta - A System for Feature Selection , 2010, Fundam. Informaticae.

[20]  Jesper Tegnér,et al.  Detecting multivariate differentially expressed genes , 2007, BMC Bioinformatics.

[21]  Steve R. Gunn,et al.  Identifying Feature Relevance Using a Random Forest , 2005, SLSFS.

[22]  Eugene Tuv,et al.  Feature Selection Using Ensemble Based Ranking Against Artificial Contrasts , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[23]  Edward R. Dougherty,et al.  Performance of feature-selection methods in the classification of high-dimension data , 2009, Pattern Recognit..

[24]  Wilfried N. Gansterer,et al.  On the Relationship Between Feature Selection and Classification Accuracy , 2008, FSDM.

[25]  J. Stuart Aitken,et al.  Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes , 2005, BMC Bioinformatics.

[26]  Gérard Dreyfus,et al.  Ranking a Random Feature for Variable and Feature Selection , 2003, J. Mach. Learn. Res..