A Systematic Approach for Variable Selection With Random Forests: Achieving Stable Variable Importance Values

Random Forests variable importance measures are often used to rank variables by their relevance to a classification problem and subsequently reduce the number of model inputs in high-dimensional data sets, thus increasing computational efficiency. However, as a result of the way that training data and predictor variables are randomly selected for use in constructing each tree and splitting each node, it is also well known that if too few trees are generated, variable importance rankings tend to differ between model runs. In this letter, we characterize the effect of the number of trees (ntree) and class separability on the stability of variable importance rankings and develop a systematic approach to define the number of model runs and/or trees required to achieve stability in variable importance measures. Results demonstrate that both a large ntree for a single model run, or averaged values across multiple model runs with fewer trees, are sufficient for achieving stable mean importance values. While the latter is far more computationally efficient, both the methods tend to lead to the same ranking of variables. Moreover, the optimal number of model runs differs depending on the separability of classes. Recommendations are made to users regarding how to determine the number of model runs and/or trees that are required to achieve stable variable importance rankings.

[1]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[2]  Alexis J. Comber,et al.  Random forest classification of salt marsh vegetation habitats using quad-polarimetric airborne SAR, elevation and optical RS data , 2014 .

[3]  Simon D. Jones,et al.  The Performance of Random Forests in an Operational Setting for Large Area Sclerophyll Forest Classification , 2013, Remote. Sens..

[4]  Ö. Akar,et al.  Integrating multiple texture methods and NDVI to the Random Forest classification algorithm to detect tea and hazelnut plantation areas in northeast Turkey , 2015 .

[5]  John Ehrlinger,et al.  ggRandomForests: Exploring Random Forest Survival , 2016, 1612.08974.

[6]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[7]  Jean-Michel Poggi,et al.  Variable selection using random forests , 2010, Pattern Recognit. Lett..

[8]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[9]  Huili Wang,et al.  Assessing the Potential to Operationalize Shoreline Sensitivity Mapping: Classifying Multiple Wide Fine Quadrature Polarized RADARSAT-2 and Landsat 5 Scenes with a Single Random Forest Model , 2015, Remote. Sens..

[10]  Koreen Millard,et al.  On the Importance of Training Data Sample Selection in Random Forest Image Classification: A Case Study in Peatland Ecosystem Mapping , 2015, Remote. Sens..

[11]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[12]  José Augusto Baranauskas,et al.  How Many Trees in a Random Forest? , 2012, MLDM.

[13]  K. Millard,et al.  Wetland mapping with LiDAR derivatives, SAR polarimetric decompositions, and LiDAR–SAR fusion using a random forest classifier , 2013 .

[14]  Joseph F. Knight,et al.  Influence of Multi-Source and Multi-Temporal Remotely Sensed and Ancillary Data on the Accuracy of Random Forest Classification of Wetlands in Northern Minnesota , 2013, Remote. Sens..

[15]  Ting Wang,et al.  Application of Breiman's Random Forest to Modeling Structure-Activity Relationships of Pharmaceutical Molecules , 2004, Multiple Classifier Systems.

[16]  Udaya B. Kogalur,et al.  High-Dimensional Variable Selection for Survival Data , 2010 .

[17]  Clement Atzberger,et al.  Tree Species Classification with Random Forest Using Very High Spatial Resolution 8-Band WorldView-2 Satellite Data , 2012, Remote. Sens..

[18]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[19]  Lori White,et al.  Moving to the RADARSAT Constellation Mission: Comparing Synthesized Compact Polarimetry and Dual Polarimetry Data with Fully Polarimetric RADARSAT-2 Data for Image Classification of Peatlands , 2017, Remote. Sens..

[20]  Víctor Urrea,et al.  Letter to the Editor: Stability of Random Forest importance measures , 2011, Briefings Bioinform..

[21]  H. Shimamura,et al.  Random forest classification of crop type using multi-temporal TerraSAR-X dual-polarimetric data , 2014 .

[22]  Alan A. Thompson,et al.  Overview of the RADARSAT Constellation Mission , 2015 .