A Bias Correction Algorithm for the Gini Variable Importance Measure in Classification Trees

This article considers a measure of variable importance frequently used in variable-selection methods based on decision trees and tree-based ensemble models. These models include CART, random forests, and gradient boosting machine. The measure of variable importance is defined as the total heterogeneity reduction produced by a given covariate on the response variable when the sample space is recursively partitioned. Despite its popularity, some authors have shown that this measure is biased to the extent that, under certain conditions, there may be dangerous effects on variable selection. Here we present a simple and effective method for bias correction, focusing on the easily generalizable case of the Gini index as a measure of heterogeneity.

[1]  Gail Gong Cross-Validation, the Jackknife, and the Bootstrap: Excess Error Estimation in Forward Logistic Regression , 1986 .

[2]  A. Hasman,et al.  Probabilistic reasoning in intelligent systems: Networks of plausible inference , 1991 .

[3]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[4]  Igor Kononenko,et al.  On Biases in Estimating Multi-Valued Attributes , 1995, IJCAI.

[5]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[6]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .

[7]  Johannes Gehrke,et al.  Bias Correction in Classification Tree Construction , 2001, ICML.

[8]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[9]  Hyunjoong Kim,et al.  Classification Trees With Unbiased Multiway Splits , 2001 .

[10]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[11]  L Rodella,et al.  Risk assessment and prediction of rebleeding in bleeding gastroduodenal ulcer. , 2002, Endoscopy.

[12]  Jerome H. Friedman,et al.  Tutorial: Getting Started with MART in R , 2002 .

[13]  Josée Dupuis,et al.  Mapping complex traits using Random Forests , 2003, BMC Genetics.

[14]  Jerome H Friedman,et al.  Multiple additive regression trees with application in epidemiology , 2003, Statistics in medicine.

[15]  David A. Bell,et al.  A Formalism for Relevance and Its Application in Feature Subset Selection , 2000, Machine Learning.

[16]  Rajarshi Guha,et al.  Development of Linear, Ensemble, and Nonlinear Models for the Prediction and Interpretation of the Biological Activity of a Set of PDGFR Inhibitors , 2004, J. Chem. Inf. Model..

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[18]  Allan P. White,et al.  Technical Note: Bias in Information-Based Measures in Decision Tree Induction , 1994, Machine Learning.

[19]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[20]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[21]  Daniel S. Myers,et al.  Simple statistical models predict C-to-U edited sites in plant mitochondrial RNA , 2004, BMC Bioinformatics.

[22]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[23]  Wei Zhong Liu,et al.  Bias in information-based measures in decision tree induction , 1994, Machine Learning.

[24]  Matthias Schonlau,et al.  Boosted Regression (Boosting): An Introductory Tutorial and a Stata Plugin , 2005 .

[25]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[26]  Ting Wang,et al.  Boosting: An Ensemble Learning Tool for Compound Classification and QSAR Modeling , 2005, J. Chem. Inf. Model..

[27]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[28]  Greg Ridgeway,et al.  Generalized Boosted Models: A guide to the gbm package , 2006 .

[29]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[30]  L. Stefanski,et al.  Approved by: Project Leader Approved by: LCG Project Leader Prepared by: Project Manager Prepared by: LCG Project Manager Reviewed by: Quality Assurance Manager , 2004 .

[31]  Bjoern H Menze,et al.  Multivariate feature selection and hierarchical classification for infrared spectroscopy: serum-based detection of bovine spongiform encephalopathy , 2007, Analytical and bioanalytical chemistry.

[32]  G. De’ath Boosted trees for ecological modeling and prediction. , 2007, Ecology.

[33]  Carolin Strobl,et al.  Unbiased split selection for classification trees based on the Gini Index , 2007, Comput. Stat. Data Anal..

[34]  K. Hornik,et al.  party : A Laboratory for Recursive Partytioning , 2009 .

[35]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.