Inferring statistically significant features from random forests

Embedded feature selection can be performed by analyzing the variables used in a Random Forest. Such a multivariate selection takes into account the interactions between variables but is not straightforward to interpret in a statistical sense. We propose a statistical procedure to measure variable importance that tests if variables are significantly useful in combination with others in a forest. We show experimentally that this new importance index correctly identifies relevant variables. The top of the variable ranking is largely correlated with Breiman׳s importance index based on a permutation test. Our measure has the additional benefit to produce p-values from the forest voting process. Such p-values offer a very natural way to decide which features are significantly relevant while controlling the false discovery rate. Practical experiments are conducted on synthetic and real data including low and high-dimensional datasets for binary or multi-class problems. Results show that the proposed technique is effective and outperforms recent alternatives by reducing the computational complexity of the selection process by an order of magnitude while keeping similar performances.

[1]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[2]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[3]  Stephen M. Stigler,et al.  Fisher and the 5% Level , 2008 .

[4]  Yvan Saeys,et al.  Statistical interpretation of machine learning-based feature importance scores for biomarker discovery , 2012, Bioinform..

[5]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[6]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[7]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[8]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[9]  Xuesong Lu,et al.  Significance of Gene Ranking for Classification of Microarray Samples , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  I. Guyon,et al.  Performance Prediction Challenge , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[11]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[14]  H. Cramér Mathematical methods of statistics , 1947 .

[15]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[16]  Michel Verleysen,et al.  The stability of feature selection and class prediction from ensemble tree classifiers , 2012, ESANN.