Exploiting tree-based variable importances to selectively identify relevant variables

This paper proposes a novel statistical procedure based on permutation tests for extracting a subset of truly relevant variables from multivariate importance rankings derived from tree-based supervised learning methods. It shows also that the direct extension of the classical approach based on permutation tests for estimating false discovery rates of univariate variable scoring procedures does not extend very well to the case of multivariate tree-based importance measures.

[1]  David Heckerman,et al.  Determining the Number of Non-Spurious Arcs in a Learned DAG Model: Investigation of a Bayesian and a Frequentist Approach , 2007, UAI.

[2]  Yogendra P. Chaubey Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[3]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[4]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[5]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[6]  Louis Wehenkel,et al.  Automatic Learning Techniques in Power Systems , 1997 .

[7]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[8]  Wei Pan,et al.  Gene expression A note on using permutation-based false discovery rate estimates to compare different analysis methods for microarray data , 2005 .

[9]  Yongchao Ge Resampling-based Multiple Testing for Microarray Data Analysis , 2003 .

[10]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[11]  Eugene Tuv,et al.  Feature Selection Using Ensemble Based Ranking Against Artificial Contrasts , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[12]  S. Dudoit,et al.  Microarray expression profiling identifies genes with altered expression in HDL-deficient mice. , 2000, Genome research.

[13]  Gérard Dreyfus,et al.  Ranking a Random Feature for Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[14]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.