论文信息 - Conditional variable importance in R package extendedForest

Conditional variable importance in R package extendedForest

The gradientForest package was developed to analyse large numbers of potential predictor variables by integrating the individual results from random forest analyses over a number of species. The random forests for each species were produced by the R package extendedForest consisting of modifications that we made to the original randomForest package [Liaw and Wiener, 2002]. One of the major modifications made to randomForest was to the method for calculating variable importance when two or more predictor variables were correlated. Many of the predictor variables used in ecological studies are either naturally (e.g., decreasing temperatures with water depth) or functionally (e.g., benthic irradiance are calculated as a function of bottom depth and light attenuation) correlated. While some of these predictors may determine species distribution or abundance other collinear predictors may not. The random subset approach for fitting predictor variables at each node could result in a correlated but less influential predictor standing in for more highly influential predictors in the early splits of an individual tree depending upon which predictor is selected in the subset. This tendency can be lessened by increasing the subsample size of predictors for each node but the trade-off would be an increase in correlation between trees in the forest with concurrent increase in generalization error and a decrease in accuracy [Breiman, 2001; see also Gromping, 2009]. Strobl et al. [2008] have also demonstrated that the permutation method for estimating variable importance exhibits a bias towards correlated predictor variables. The underlying reason for this behaviour has to do with the structure of the null hypothesis, i.e., independence between the response Y and the predictor Xj being permuted, implied by the importance measure. A small value for the importance measure would suggest that Y and Xj are independent but also assumes that Xj is independent of the other predictor variables Z in the model that were not permuted (Z = X, . . . ,Xj−1, Xj+1, . . . , Xp). Correlation between Xj and Z will result in an

Stephen J. Smith | Nick Ellis | C. Roland Pitcher

[1] Andy Liaw,et al. Classification and Regression by randomForest , 2007 .

[2] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[3] U. Grömping. Dependence of Variable Importance in Random Forests on the Shape of the Regressor Space , 2009 .

[4] Achim Zeileis,et al. BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .