Grouped and Hierarchical Model Selection through Composite Absolute Penalties

Extracting useful information from high-dimensional data is an important part of the focus of today’s statistical research and practice. Penalized loss function minimization has been shown to be eective for this task both theoretically and empirically. With the virtues of both regularization and sparsity, the L1-penalized L2 minimization method Lasso has been popular in regression models. In this paper, we combine dierent norms including L1 to form an intelligent penalty in order to add side information to the fitting of a regression or classification model to obtain reasonable estimates. Specifically, we introduce the Composite Absolute Penalties (CAP) family which allows the grouping and hierarchical relationships between the predictors to be expressed. CAP penalties are built by defining groups and combining the properties of norm penalties at the across group and within group levels. Grouped selection occurs for non-overlapping groups. In that case, we give a Bayesian

[1]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[2]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[3]  C. L. Mallows Some comments on C_p , 1973 .

[4]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[5]  N. Sugiura Further analysts of the data by akaike' s information criterion and the finite corrections , 1978 .

[6]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[7]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[8]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[9]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[10]  J. Friedman,et al.  [A Statistical View of Some Chemometrics Regression Tools]: Response , 1993 .

[11]  R. Tibshirani,et al.  An introduction to the bootstrap , 1993 .

[12]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[13]  D. Donoho,et al.  Basis pursuit , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[14]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[15]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[16]  C. Mallows More comments on C p , 1995 .

[17]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[18]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[19]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[20]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[21]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[22]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[23]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[24]  M. R. Osborne,et al.  A new approach to variable selection in least squares problems , 2000 .

[25]  P. Bühlmann,et al.  Boosting with the L2-loss: regression and classification , 2001 .

[26]  D. Donoho,et al.  Atomic Decomposition by Basis Pursuit , 2001 .

[27]  Yuhong Yang Can the Strengths of AIC and BIC Be Shared , 2005 .

[28]  Annette M. Molinaro,et al.  Loss-based estimation with cross-validation: applications to microarray data analysis , 2003, SKDD.

[29]  Bin Yu,et al.  Cloud Detection over Snow and Ice Using MISR Data , 2004 .

[30]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[31]  P. Zhao Boosted Lasso , 2004 .

[32]  Saharon Rosset,et al.  Tracking Curved Regularized Optimization Solution Paths , 2004, NIPS 2004.

[33]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[34]  B. Efron The Estimation of Prediction Error , 2004 .

[35]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[36]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[37]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[38]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[39]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[40]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[41]  G. Wahba,et al.  A NOTE ON THE LASSO AND RELATED PROCEDURES IN MODEL SELECTION , 2006 .

[42]  Mee Young Park,et al.  L1‐regularization path algorithm for generalized linear models , 2007 .

[43]  S. Rosset,et al.  Piecewise linear regularized solution paths , 2007, 0708.2197.

[44]  S. Pandey,et al.  What Are Degrees of Freedom , 2008 .

[45]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .