A Framework for Unbiased Model Selection Based on Boosting

Variable and model selection are of major concern in many statistical applications, especially in high-dimensional regression models. Boosting is a convenient statistical method that combines model fitting with intrinsic model selection. We investigate the impact of base-learner specification on the performance of boosting as a model selection procedure. We show that variable selection may be biased if the covariates are of different nature. Important examples are models combining continuous and categorical covariates, especially if the number of categories is large. In this case, least squares base-learners offer increased flexibility for the categorical covariate and lead to a preference even if the categorical covariate is noninformative. Similar difficulties arise when comparing linear and nonlinear base-learners for a continuous covariate. The additional flexibility in the nonlinear base-learner again yields a preference of the more complex modeling alternative. We investigate these problems from a theoretical perspective and suggest a framework for bias correction based on a general class of penalized least squares base-learners. Making all base-learners comparable in terms of their degrees of freedom strongly reduces the selection bias observed in naive boosting specifications. The importance of unbiased model selection is demonstrated in simulations. Supplemental materials including an application to forest health models, additional simulation results, additional theorems, and proofs for the theorems are available online.

[1]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[2]  A. Boulesteix,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[3]  Ludwig Fahrmeir,et al.  A Space-Time Study on Forest Health , 2011 .

[4]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[5]  W. Loh,et al.  REGRESSION TREES WITH UNBIASED VARIABLE SELECTION AND INTERACTION DETECTION , 2002 .

[6]  L. Fahrmeir,et al.  PENALIZED STRUCTURED ADDITIVE REGRESSION FOR SPACE-TIME DATA: A BAYESIAN PERSPECTIVE , 2004 .

[7]  M. J. Laan Statistical Inference for Variable Importance , 2006 .

[8]  Hyunjoong Kim,et al.  Classification Trees With Bivariate Linear Discriminant Node Models , 2003 .

[9]  Torsten Hothorn,et al.  Model-based Boosting 2.0 , 2010, J. Mach. Learn. Res..

[10]  P. Bühlmann,et al.  Survival ensembles. , 2006, Biostatistics.

[11]  B. Yu,et al.  Boosting with the L 2-loss regression and classification , 2001 .

[12]  Tata Subba Rao,et al.  Statistical Methods for Trend Detection and Analysis in the Environmental Sciences , 2012 .

[13]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[14]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[15]  G. Tutz,et al.  Generalized Additive Modeling with Implicit Variable Selection by Likelihood‐Based Boosting , 2006, Biometrics.

[16]  P. Bühlmann,et al.  Boosting with the L2-loss: regression and classification , 2001 .

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[18]  Paul H. C. Eilers,et al.  Flexible smoothing with B-splines and penalties , 1996 .

[19]  W. Loh,et al.  Tree-Structured Classification via Generalized Discriminant Analysis. , 1988 .

[20]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[21]  D. Ruppert Selecting the Number of Knots for Penalized Splines , 2002 .

[22]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[23]  G. Tutz,et al.  Penalized Regression with Ordinal Predictors , 2009 .

[24]  B. Ripley,et al.  Semiparametric Regression: Preface , 2003 .

[25]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[26]  Harald Binder,et al.  Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models , 2008, BMC Bioinformatics.

[27]  Gerhard Tutz,et al.  Variable Selection and Model Choice in Geoadditive Regression Models , 2009, Biometrics.

[28]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[29]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[30]  Torsten Hothorn,et al.  Boosting additive models using component-wise P-Splines , 2008, Comput. Stat. Data Anal..

[31]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[32]  Torsten Hothorn,et al.  Flexible boosting of accelerated failure time models , 2008, BMC Bioinformatics.

[33]  Thomas Kneib,et al.  Structured additive regression for categorical space-time data: a mixed model approach. , 2006, Biometrics.

[34]  R. Tibshirani,et al.  Linear Smoothers and Additive Models , 1989 .

[35]  P. Bühlmann,et al.  Boosting With the L2 Loss , 2003 .

[36]  Torsten Hothorn,et al.  Variable selection and model choice in structured survival models , 2013, Comput. Stat..