Convex vs non-convex estimators for regression and sparse estimation: the mean squared error properties of ARD and GLasso

We study a simple linear regression problem for grouped variables; we are interested in methods which jointly perform estimation and variable selection, that is, that automatically set to zero groups of variables in the regression vector. The Group Lasso (GLasso), a well known approach used to tackle this problem which is also a special case of Multiple Kernel Learning (MKL), boils down to solving convex optimization problems. On the other hand, a Bayesian approach commonly known as Sparse Bayesian Learning (SBL), a version of which is the well known Automatic Relevance Determination (ARD), lead to nonconvex problems. In this paper we discuss the relation between ARD (and a penalized version which we call PARD) and Glasso, and study their asymptotic properties in terms of the Mean Squared Error in estimating the unknown parameter. The theoretical arguments developed here are independent of the correctness of the prior models and clarify the advantages of PARD over GLasso.

[1]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[2]  B. M. Pötscher,et al.  MODEL SELECTION AND INFERENCE: FACTS AND FICTION , 2005, Econometric Theory.

[3]  David L Donoho,et al.  Compressed sensing , 2006, IEEE Transactions on Information Theory.

[4]  J. S. Maritz,et al.  Empirical Bayes Methods , 1974 .

[5]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[6]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[7]  Giuseppe De Nicolao,et al.  Bayesian Online Multitask Learning of Gaussian Processes , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Giuseppe De Nicolao,et al.  A new kernel-based approach for linear system identification , 2010, Autom..

[9]  Dietmar Bauer,et al.  Asymptotic properties of subspace estimators , 2005, Autom..

[10]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[11]  F. Chatelin Spectral approximation of linear operators , 2011 .

[12]  David P. Wipf,et al.  A New View of Automatic Relevance Determination , 2007, NIPS.

[13]  Alessandro Chiuso,et al.  On the Estimation of Hyperparameters for Empirical Bayes Estimators: Maximum Marginal Likelihood vs Minimum MSE , 2012 .

[14]  David J. C. MacKay,et al.  BAYESIAN NON-LINEAR MODELING FOR THE PREDICTION COMPETITION , 1996 .

[15]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[16]  Charles R. Johnson,et al.  Topics in Matrix Analysis , 1991 .

[17]  Mark W. Schmidt,et al.  Optimizing Costly Functions with Simple Constraints: A Limited-Memory Projected Quasi-Newton Algorithm , 2009, AISTATS.

[18]  G. Casella,et al.  The Bayesian Lasso , 2008 .

[19]  Miguel Lázaro-Gredilla,et al.  Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning , 2011, NIPS.

[20]  Alessandro Chiuso,et al.  Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors , 2010, NIPS.

[21]  Bhaskar D. Rao,et al.  Latent Variable Bayesian Models for Promoting Sparsity , 2011, IEEE Transactions on Information Theory.

[22]  Bhaskar D. Rao,et al.  An Empirical Bayesian Strategy for Solving the Simultaneous Sparse Approximation Problem , 2007, IEEE Transactions on Signal Processing.

[23]  Yoav Benjamini,et al.  Microarrays, Empirical Bayes and the Two-Groups Model. Comment. , 2008 .

[24]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[25]  Charles A. Micchelli,et al.  Learning Multiple Tasks with Kernel Methods , 2005, J. Mach. Learn. Res..

[26]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[27]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[28]  Alessandro Chiuso,et al.  Nonparametric sparse estimators for identification of large scale linear systems , 2010, 49th IEEE Conference on Decision and Control (CDC).

[29]  L. Ljung,et al.  On the Estimation of Transfer Functions, Regularizations and Gaussian Processes – Revisited , 2011 .

[30]  Alessandro Chiuso,et al.  A Bayesian approach to sparse dynamic network identification , 2012, Autom..

[31]  Dennis V. Lindley,et al.  Empirical Bayes Methods , 1974 .

[32]  Alessandro Chiuso,et al.  Prediction error identification of linear systems: A nonparametric Gaussian regression approach , 2011, Autom..

[33]  B. Efron,et al.  Stein's Estimation Rule and Its Competitors- An Empirical Bayes Approach , 1973 .

[34]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[35]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[36]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[37]  Taiji Suzuki,et al.  Regularization Strategies and Empirical Bayesian Learning for MKL , 2010, ArXiv.

[38]  C. Stein Estimation of the Mean of a Multivariate Normal Distribution , 1981 .

[39]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[40]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .