Exploration of the variability of variable selection based on distances between bootstrap sample results

It is well known that variable selection in multiple regression can be unstable and that the model uncertainty can be considerable. The model uncertainty can be quantified and explored by bootstrap resampling, see Sauerbrei et al. (Biom J 57:531–555, 2015). Here approaches are introduced that use the results of bootstrap replications of the variable selection process to obtain more detailed information about the data. Analyses will be based on dissimilarities between the results of the analyses of different bootstrap samples. Dissimilarities are computed between the vector of predictions, and between the sets of selected variables. The dissimilarities are used to map the models by multidimensional scaling, to cluster them, and to construct heatplots. Clusters can point to different interpretations of the data that could arise from different selections of variables supported by different bootstrap samples. A new measure of variable selection instability is also defined. The methodology can be applied to various regression models, estimators, and variable selection methods. It will be illustrated by three real data examples, using linear regression and a Cox proportional hazards model, and model selection by AIC and BIC.

[1]  J. Shao Bootstrap Model Selection , 1996 .

[2]  Anne-Laure Boulesteix,et al.  On stability issues in deriving multivariable regression models , 2015, Biometrical journal. Biometrische Zeitschrift.

[3]  A. Raftery,et al.  Bayesian Information Criterion for Censored Survival Models , 2000, Biometrics.

[4]  M Schumacher,et al.  Long- and medium-term ozone effects on lung growth including a broad spectrum of exposure , 2004, European Respiratory Journal.

[5]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[6]  A. Buja,et al.  Valid post-selection inference , 2013, 1306.1059.

[7]  P. Grambsch,et al.  Martingale-based residuals for survival models , 1990 .

[8]  Fionn Murtagh,et al.  Handbook of Cluster Analysis , 2015 .

[9]  Alexander M. Mood,et al.  Equality of Educational Opportunity. , 1967 .

[10]  P. Groenen,et al.  Applied Multidimensional Scaling , 2012 .

[11]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[12]  Christian Hennig,et al.  Design of Dissimilarity Measures: A New Dissimilarity Between Species Distribution Areas , 2006, Data Science and Classification.

[13]  L. García-Escudero,et al.  Finding the Number of Groups in Model-Based Clustering via Constrained Likelihoods , 2016 .

[14]  D.,et al.  Regression Models and Life-Tables , 2022 .

[15]  Patrick Royston,et al.  Multivariable Model-Building: A Pragmatic Approach to Regression Analysis based on Fractional Polynomials for Modelling Continuous Variables , 2008 .

[16]  M Schumacher,et al.  A bootstrap resampling procedure for model building: application to the Cox regression model. , 1992, Statistics in medicine.

[17]  Patrick Mair,et al.  Multidimensional Scaling Using Majorization: SMACOF in R , 2008 .

[18]  P. Jaccard Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines , 1901 .

[19]  Luis Angel García-Escudero,et al.  Finding the Number of Normal Groups in Model-Based Clustering via Constrained Likelihoods , 2018 .

[20]  Willi Sauerbrei,et al.  On properties of predictors derived with a two-step bootstrap model averaging approach - A simulation study in the linear regression model , 2008, Comput. Stat. Data Anal..

[21]  J. Harley,et al.  A step-up procedure for selecting variables associated with survival. , 1975, Biometrics.

[22]  Christian Hennig,et al.  Clustering strategy and method selection , 2015, 1503.02059.

[23]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[24]  L. Breiman The Little Bootstrap and other Methods for Dimensionality Selection in Regression: X-Fixed Prediction Error , 1992 .

[25]  R. Sokal,et al.  THE COMPARISON OF DENDROGRAMS BY OBJECTIVE METHODS , 1962 .

[26]  Anthony C. Atkinson,et al.  Robust model selection with flexible trimming , 2010, Comput. Stat. Data Anal..

[27]  B. Efron Estimation and Accuracy After Model Selection , 2014, Journal of the American Statistical Association.

[28]  Alan Welsh,et al.  mplot: An R Package for Graphical Model Stability and Variable Selection Procedures , 2015, 1509.07583.