The Cluster Elastic Net for High-Dimensional Regression With Unknown Variable Grouping

In the high-dimensional regression setting, the elastic net produces a parsimonious model by shrinking all coefficients toward the origin. However, in certain settings, this behavior might not be desirable: if some features are highly correlated with each other and associated with the response, then we might wish to perform less shrinkage on the coefficients corresponding to that subset of features. We propose the cluster elastic net, which selectively shrinks the coefficients for such variables toward each other, rather than toward the origin. Instead of assuming that the clusters are known a priori, the cluster elastic net infers clusters of features from the data, on the basis of correlation among the variables as well as association with the response. These clusters are then used to more accurately perform regression. We demonstrate the theoretical advantages of our proposed approach, and explore its performance in a simulation study, and in an application to HIV drug resistance data. Supplementary materials are available online.

[1]  Hongzhe Li,et al.  In Response to Comment on "Network-constrained regularization and variable selection for analysis of genomic data" , 2008, Bioinform..

[2]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .

[3]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[4]  Z. John Daye,et al.  Shrinkage and model selection with correlated variables via weighted fusion , 2009, Comput. Stat. Data Anal..

[5]  Y. She Sparse regression with exact clustering , 2008 .

[6]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[7]  H. Bondell,et al.  Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR , 2008, Biometrics.

[8]  Hao Helen Zhang,et al.  Consistent Group Identification and Variable Selection in Regression With Correlated Predictors , 2013, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[9]  Yves Grandvalet,et al.  Sparsity with sign-coherent groups of variables via the cooperative-Lasso , 2011, The Annals of Applied Statistics.

[10]  R. Shafer Rationale and uses of a public HIV drug-resistance database. , 2006, The Journal of infectious diseases.

[11]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[12]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[13]  Jean-Philippe Vert,et al.  Group lasso with overlap and graph lasso , 2009, ICML '09.

[14]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[15]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[16]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[17]  Wei Pan,et al.  Simultaneous supervised clustering and feature selection over a graph. , 2012, Biometrika.

[18]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[19]  S. Geer,et al.  Correlated variables in regression: Clustering and sparse estimation , 2012, 1209.5908.

[20]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[21]  Elizaveta Levina,et al.  Discussion of "Stability selection" by N. Meinshausen and P. Buhlmann , 2010 .

[22]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[23]  Peter Bühlmann,et al.  Finding predictive gene groups from microarray data , 2004 .

[24]  A. Buja,et al.  Valid post-selection inference , 2013, 1306.1059.

[25]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[26]  R. Shafer,et al.  Genotypic predictors of human immunodeficiency virus type 1 drug resistance , 2006, Proceedings of the National Academy of Sciences.

[27]  Trevor Hastie,et al.  Averaged gene expressions for regression. , 2007, Biostatistics.

[28]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[29]  Gerhard Tutz,et al.  Penalized regression with correlation-based penalty , 2009, Stat. Comput..

[30]  Hongzhe Li,et al.  VARIABLE SELECTION AND REGRESSION ANALYSIS FOR GRAPH-STRUCTURED COVARIATES WITH AN APPLICATION TO GENOMICS. , 2010, The annals of applied statistics.

[31]  R. Tibshirani,et al.  A SIGNIFICANCE TEST FOR THE LASSO. , 2013, Annals of statistics.

[32]  Jian Huang,et al.  The Sparse Laplacian Shrinkage Estimator for High-Dimensional Regression. , 2011, Annals of statistics.