Consistent Group Identification and Variable Selection in Regression With Correlated Predictors

Statistical procedures for variable selection have become integral elements in any analysis. Successful procedures are characterized by high predictive accuracy, yielding interpretable models while retaining computational efficiency. Penalized methods that perform coefficient shrinkage have been shown to be successful in many cases. Models with correlated predictors are particularly challenging to tackle. We propose a penalization procedure that performs variable selection while clustering groups of predictors automatically. The oracle properties of this procedure, including consistency in group identification, are also studied. The proposed method compares favorably with existing selection approaches in both prediction accuracy and model discovery, while retaining its computational efficiency. Supplementary materials are available online.

[1]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[2]  J. H. Schuenemeyer,et al.  Generalized Linear Models (2nd ed.) , 1992 .

[3]  Hao Helen Zhang,et al.  ON THE ADAPTIVE ELASTIC-NET WITH A DIVERGING NUMBER OF PARAMETERS. , 2009, Annals of statistics.

[4]  Eric P. Xing,et al.  A multivariate regression approach to association analysis of a quantitative trait network , 2008, Bioinform..

[5]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[6]  Hao Helen Zhang,et al.  Adaptive Lasso for Cox's proportional hazards model , 2007 .

[7]  H. Bondell,et al.  Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR , 2008, Biometrics.

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[10]  Trevor Hastie,et al.  Averaged gene expressions for regression. , 2007, Biostatistics.

[11]  P. McCullagh,et al.  Generalized Linear Models, 2nd Edn. , 1990 .

[12]  G. C. McDonald,et al.  Instabilities of Regression Estimates Relating Air Pollution to Mortality , 1973 .

[13]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[14]  H. Bondell,et al.  Simultaneous regression shrinkage , variable selection and clustering of predictors with OSCAR , 2006 .

[15]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[16]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[17]  W. Mangold,et al.  The Impact of Intercollegiate Athletics on Graduation Rates among Major NCAA Division I Universities , 2003 .

[18]  Shifeng Xiong,et al.  Better subset regression , 2012, 1212.0634.

[19]  T. Stukel,et al.  Determinants of plasma levels of beta-carotene and retinol. Skin Cancer Prevention Study Group. , 1989, American journal of epidemiology.

[20]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[21]  L. Breiman Better subset regression using the nonnegative garrote , 1995 .

[22]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[23]  H. Bondell,et al.  Simultaneous Factor Selection and Collapsing Levels in ANOVA , 2009, Biometrics.

[24]  Hansheng Wang,et al.  Robust Regression Shrinkage and Consistent Variable Selection Through the LAD-Lasso , 2007 .

[25]  William D Mangold,et al.  The Impact of Intercollegiate Athletics on Graduation Rates Among Major NCAA Division I Universities: Implications for College Persistence Theory and Practice , 2003 .

[26]  Gerhard Tutz,et al.  Penalized regression with correlation-based penalty , 2009, Stat. Comput..

[27]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[28]  Chenlei Leng,et al.  Unified LASSO Estimation by Least Squares Approximation , 2007 .

[29]  D. Hunter,et al.  Variable Selection using MM Algorithms. , 2005, Annals of statistics.