Resampling methods for variable selection in robust regression

With the inundation of large data sets requiring analysis and empirical model building, outliers have become commonplace. Fortunately, several standard statistical software packages have allowed practitioners to use robust regression estimators to easily fit data sets that are contaminated with outliers. However, little guidance is available for selecting the best subset of the predictor variables when using these robust estimators. We initially consider cross-validation and bootstrap resampling methods that have performed well for least-squares variable selection. It turns out that these variable selection methods cannot be directly applied to contaminated data sets using a robust estimation scheme. The prediction errors, inflated by the outliers, are not reliable measures of how well the robust model fits the data.As a result, new resampling variable selection methods are proposed by introducing alternative estimates of prediction error in the contaminated model. We demonstrate that, although robust estimation and resampling variable selection are computationally complex procedures, we can combine both techniques for superior results using modest computational resources. Monte Carlo simulation is used to evaluate the proposed variable selection procedures against alternatives through a designed experiment approach. The experiment factors include percentage of outliers, outlier geometry, bootstrap sample size, number of bootstrap samples, and cross-validation assessment size. The results are summarized and recommendations for use are provided.

[1]  R. Gunst Regression analysis and its application , 1980 .

[2]  B. Efron,et al.  The Jackknife: The Bootstrap and Other Resampling Plans. , 1983 .

[3]  Werner A. Stahel,et al.  Robust Statistics: The Approach Based on Influence Functions , 1987 .

[4]  Ping Zhang On the Distributional Properties of Model Selection Criteria , 1992 .

[5]  R. Wilcox Introduction to Robust Estimation and Hypothesis Testing , 1997 .

[6]  Chris Field,et al.  Robust regression and small sample confidence intervals , 1997 .

[7]  L. Breiman Better subset regression using the nonnegative garrote , 1995 .

[8]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[9]  J. Shao Bootstrap Model Selection , 1996 .

[10]  Douglas C. Montgomery,et al.  THE DEVELOPMENT AND EVALUATION OF ALTERNATIVE GENERALIZED M-ESTIMATION TECHNIQUES , 1998 .

[11]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[12]  Changbao Wu,et al.  Jackknife, Bootstrap and Other Resampling Methods in Regression Analysis , 1986 .

[13]  David L. Woodruff,et al.  Identification of Outliers in Multivariate Data , 1996 .

[14]  C. W. Coakley,et al.  A Bounded Influence, High Breakdown, Efficient Regression Estimator , 1993 .

[15]  C. L. Mallows Some comments on C_p , 1973 .

[16]  Anthony C. Davison,et al.  Bootstrap Methods and Their Application , 1998 .

[17]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[18]  Elvezio Ronchetti,et al.  A Robust Version of Mallows's C P , 1994 .

[19]  Bradley Efron,et al.  Censored Data and the Bootstrap , 1981 .

[20]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[21]  Rand R. Wilcox,et al.  The goals and strategies of robust methods , 1998 .

[22]  Alan J. Miller,et al.  Subset Selection in Regression , 1991 .

[23]  Chris Field,et al.  Robust Confidence Intervals for Regression Parameters , 1998 .

[24]  E. Ronchetti,et al.  Robust Bounded-Influence Tests in General Parametric Models , 1994 .

[25]  L. Breiman,et al.  Submodel selection and evaluation in regression. The X-random case , 1992 .

[26]  Xuming He,et al.  Bounded Influence and High Breakdown Point Testing Procedures in Linear Models , 1994 .