Torsten Hothorn Bias in Random Forest Variable Importance Measures : Illustrations , Sources and a Solution Paper

Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale level or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale level or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analysing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.

[1]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[2]  K. Hornik,et al.  party : A Laboratory for Recursive Partytioning , 2009 .

[3]  Carolin Strobl,et al.  Unbiased split selection for classification trees based on the Gini Index , 2007, Comput. Stat. Data Anal..

[4]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[5]  Anne-Laure Boulesteix,et al.  Maximally Selected Chi‐Square Statistics and Binary Splits of Nominal Variables , 2006, Biometrical journal. Biometrische Zeitschrift.

[6]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[7]  Anne-Laure Boulesteix,et al.  Maximally Selected Chi‐square Statistics for Ordinal Variables , 2006, Biometrical journal. Biometrische Zeitschrift.

[8]  Ziv Bar-Joseph,et al.  Evaluation of different biological data and computational classification methods for use in protein interaction prediction , 2006, Proteins.

[9]  A. G. Heidema,et al.  The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases , 2006, BMC Genetics.

[10]  M. J. Laan Statistical Inference for Variable Importance , 2006 .

[11]  Sinisa Pajevic,et al.  Short-term prediction of mortality in patients with systemic lupus erythematosus: classification of outcomes using random forests. , 2006, Arthritis and rheumatism.

[12]  Christopher James Langmead,et al.  Structure-Based Chemical Shift Prediction Using Random Forests Non-Linear Regression , 2005, APBC.

[13]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[14]  Wei Pan,et al.  A comparative study of discriminating human heart failure etiology using gene expression profiles , 2005, BMC Bioinformatics.

[15]  Steve Horvath,et al.  Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma , 2005, Modern Pathology.

[16]  P. Jurs,et al.  Development of Linear, Ensemble, and Nonlinear Models for the Prediction and Interpretation of the Biological Activity of a Set of PDGFR Inhibitors. , 2005 .

[17]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[18]  C. Strobl Variable Selection in Classification Trees Based on Imprecise Probabilities , 2005, ISIPTA.

[19]  Carolin Strobl,et al.  Statistical sources of variable selection bias in classification tree algorithms based on the Gini index , 2005 .

[20]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[21]  Mark R. Segal,et al.  Few amino acid positions in rpoB are associated with most of the rifampin resistance in Mycobacterium tuberculosis , 2004, BMC Bioinformatics.

[22]  Daniel S. Myers,et al.  Simple statistical models predict C-to-U edited sites in plant mitochondrial RNA , 2004, BMC Bioinformatics.

[23]  M. Segal,et al.  Relating HIV-1 Sequence Variation to Replication Capacity via Trees and Forests , 2004, Statistical applications in genetics and molecular biology.

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[26]  D. Stone,et al.  Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Cesare Furlanello,et al.  GIS and the Random Forest Predictor: Integration in R for Tick-Borne Disease Risk Assessment , 2003 .

[28]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[29]  Johannes Gehrke,et al.  Bias Correction in Classification Tree Construction , 2001, ICML.

[30]  Hyunjoong Kim,et al.  Classification Trees With Unbiased Multiway Splits , 2001 .

[31]  Igor Kononenko,et al.  On Biases in Estimating Multi-Valued Attributes , 1995, IJCAI.

[32]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .