Variable importance scores

Scoring of variables for importance in predicting a response is an ill-defined concept. Several methods have been proposed but little is known of their performance. This paper fills the gap with a comparative evaluation of eleven methods and an updated one based on the GUIDE algorithm. For data without missing values, eight of the methods are shown to be biased in that they give higher or lower scores to different types of variables, even when all are independent of the response. Of the remaining four methods, only two are applicable to data with missing values, with GUIDE the only unbiased one. GUIDE achieves unbiasedness by using a self-calibrating step that is applicable to other methods for score de-biasing. GUIDE also yields a threshold for distinguishing important from unimportant variables at 95 and 99 percent confidence levels; the technique is applicable to other methods as well. Finally, the paper studies the relationship of the scores to predictive power in three data sets. It is found that the scores of many methods are more consistent with marginal predictive power than conditional predictive power.

[1]  W. Loh,et al.  A regression tree approach to identifying subgroups with differential treatment effects , 2014, Statistics in medicine.

[2]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[3]  P. Chaudhuri,et al.  Piecewise polynomial regression trees , 1994 .

[4]  Wei Zhong Liu,et al.  Bias in information-based measures in decision tree induction , 1994, Machine Learning.

[5]  W. Loh,et al.  Classification and regression trees and forests for incomplete data from sample surveys , 2018 .

[6]  Zhenzhou Lu,et al.  Variable importance analysis: A comprehensive review , 2015, Reliab. Eng. Syst. Saf..

[7]  H. Chipman,et al.  BART: Bayesian Additive Regression Trees , 2008, 0806.3286.

[8]  Hans Knutsson,et al.  Reinforcement Learning Trees , 1996 .

[9]  Stefano Nembrini,et al.  The revival of the Gini importance? , 2018, Bioinform..

[10]  W. Loh,et al.  Tree-Structured Classification via Generalized Discriminant Analysis. , 1988 .

[11]  David C. Hoaglin,et al.  A Critical Look at Some Analyses of Major League Baseball Salaries , 1995 .

[12]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[13]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .

[14]  J. Bring How to Standardize Regression Coefficients , 1994 .

[15]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[16]  W. Loh,et al.  Improving the precision of classification trees , 2010, 1011.0608.

[17]  L. Stefanski,et al.  Approved by: Project Leader Approved by: LCG Project Leader Prepared by: Project Manager Prepared by: LCG Project Manager Reviewed by: Quality Assurance Manager , 2004 .

[18]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[19]  W. Loh,et al.  MISSING DATA, IMPUTATION AND REGRESSION TREES , 2020 .

[20]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[21]  Wei-Yin Loh,et al.  Variable Selection for Classification and Regression in Large p, Small n Problems , 2012 .

[22]  T. Therneau,et al.  An Introduction to Recursive Partitioning Using the RPART Routines , 2015 .

[23]  Jian Bi A REVIEW OF STATISTICAL METHODS FOR DETERMINATION OF RELATIVE IMPORTANCE OF CORRELATED PREDICTORS AND IDENTIFICATION OF DRIVERS OF CONSUMER LIKING , 2012 .

[24]  Hemant Ishwaran,et al.  Random Survival Forests , 2008, Wiley StatsRef: Statistics Reference Online.

[25]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[26]  Marco Sandri,et al.  A Bias Correction Algorithm for the Gini Variable Importance Measure in Classification Trees , 2008 .

[27]  Edward I. George,et al.  Variable selection for BART: An application to gene regulation , 2013, 1310.4887.

[28]  Hyunjoong Kim,et al.  Classification Trees With Unbiased Multiway Splits , 2001 .

[29]  H. Ishwaran Variable importance in binary regression trees and forests , 2007, 0711.2434.

[30]  Udaya B. Kogalur,et al.  Random Survival Forests for R , 2007 .

[31]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[32]  W. Loh,et al.  REGRESSION TREES WITH UNBIASED VARIABLE SELECTION AND INTERACTION DETECTION , 2002 .

[33]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[34]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[35]  G. Lip,et al.  Comorbidities associated with mortality in 31,461 adults with COVID-19 in the United States: A federated electronic medical record analysis , 2020, PLoS medicine.

[36]  Donglin Zeng,et al.  Reinforcement Learning Trees , 2015, Journal of the American Statistical Association.

[37]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[38]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.