Simultaneous Support Recovery in High Dimensions: Benefits and Perils of Block $\ell _{1}/\ell _{\infty} $-Regularization

Given a collection of <i>r</i> ≥ 2 linear regression problems in <i>p</i> dimensions, suppose that the regression coefficients share partially common supports of size at most <i>s</i>. This set-up suggests the use of ℓ<sub>1</sub>/ℓ<sub>∞</sub>-regularized regression for joint estimation of the <i>p</i>×<i>r</i> matrix of regression coefficients. We analyze the high-dimensional scaling of ℓ<sub>1</sub>/ℓ<sub>∞</sub>-regularized quadratic programming, considering both consistency rates in ℓ<sub>∞</sub>-norm, and how the minimal sample size <i>n</i> required for consistent variable selection scales with model dimension, sparsity, and overlap between the supports. We first establish bounds on the ℓ<sub>∞</sub>-error as well sufficient conditions for exact variable selection for fixed design matrices, as well as for designs drawn randomly from general Gaussian distributions. Specializing to the case <i>r</i> = 2 linear regression problems with standard Gaussian designs whose supports overlap in a fraction α ∈ [0,1] of their entries, we prove that ℓ<sub>1</sub>/ℓ<sub>∞</sub>-regularized method undergoes a phase transition characterized by the rescaled sample size θ<sub>1,∞</sub>(<i>n</i>, <i>p</i>, <i>s</i>, α) = <i>n</i>/{(4 - 3 α) <i>s</i> log(<i>p</i>-(2- α) <i>s</i>)}. An implication is that the use of ℓ<sub>1</sub>/ℓ<sub>∞</sub>-regularization yields improved statistical efficiency if the overlap parameter is large enough ( α >; 2/3), but has worse statistical efficiency than a naive Lasso-based approach for moderate to small overlap (α <; 2/3 ). Empirical simulations illustrate the close agreement between theory and actual behavior in practice. These results show that caution must be exercised in applying <i>ℓ</i><sub>1</sub>/ℓ<sub>∞</sub> block regularization: if the data does not match its structure very closely, it can impair statistical performance relative to computationally less expensive schemes.

[1]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[2]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[3]  J. Hiriart-Urruty,et al.  Convex analysis and minimization algorithms , 1993 .

[4]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[5]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[6]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[7]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[8]  S. Mallat A wavelet tour of signal processing , 1998 .

[9]  Eero P. Simoncelli Bayesian Denoising of Visual Images in the Wavelet Domain , 1999 .

[10]  V. Buldygin,et al.  Metric characterization of random variables and random processes , 2000 .

[11]  P. Massart,et al.  Adaptive estimation of a quadratic functional by model selection , 2000 .

[12]  M. Ledoux The concentration of measure phenomenon , 2001 .

[13]  S. Szarek,et al.  Chapter 8 - Local Operator Theory, Random Matrices and Banach Spaces , 2001 .

[14]  Jia Jie Bayesian denoising of visual images in the wavelet domain , 2003 .

[15]  Stephen J. Wright,et al.  Simultaneous Variable Selection , 2005, Technometrics.

[16]  D. Donoho,et al.  Counting faces of randomly-projected polytopes when the projection radically lowers dimension , 2006, math/0607364.

[17]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[18]  Joel A. Tropp,et al.  ALGORITHMS FOR SIMULTANEOUS SPARSE APPROXIMATION , 2006 .

[19]  Martin J. Wainwright,et al.  Sharp thresholds for high-dimensional and noisy recovery of sparsity , 2006, ArXiv.

[20]  Joel A. Tropp,et al.  Just relax: convex programming methods for identifying sparse signals in noise , 2006, IEEE Transactions on Information Theory.

[21]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[22]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[23]  Joel A. Tropp,et al.  Algorithms for simultaneous sparse approximation. Part I: Greedy pursuit , 2006, Signal Process..

[24]  J. Lafferty,et al.  Sparse additive models , 2007, 0711.4555.

[25]  P. Zhao,et al.  Grouped and Hierarchical Model Selection through Composite Absolute Penalties , 2007 .

[26]  Larry A. Wasserman,et al.  SpAM: Sparse Additive Models , 2007, NIPS.

[27]  G. Obozinski Joint covariate selection for grouped classification , 2007 .

[28]  A. Rinaldo,et al.  On the asymptotic properties of the group lasso estimator for linear models , 2008 .

[29]  Larry A. Wasserman,et al.  Nonparametric regression and classification with joint sparsity constraints , 2008, NIPS.

[30]  Francis R. Bach,et al.  Consistency of trace norm minimization , 2007, J. Mach. Learn. Res..

[31]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[32]  Michael I. Jordan,et al.  Union support recovery in high-dimensional multivariate regression , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[33]  Han Liu,et al.  On the ℓ 1 -ℓ q Regularized Regression , 2008 .

[34]  Han Liu,et al.  On the ℓ 1 -ℓ q Regularized Regression , 2008, 0802.1517.

[35]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[36]  N. Meinshausen,et al.  LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA , 2008, 0806.0145.

[37]  S. Geer,et al.  On the conditions used to prove oracle results for the Lasso , 2009, 0910.0722.

[38]  P. Zhao,et al.  The composite absolute penalties family for grouped and hierarchical variable selection , 2009, 0909.0411.

[39]  Massimiliano Pontil,et al.  Taking Advantage of Sparsity in Multi-Task Learning , 2009, COLT.

[40]  Martin J. Wainwright,et al.  Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting , 2009, IEEE Trans. Inf. Theory.

[41]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[42]  Martin J. Wainwright,et al.  Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) , 2009, IEEE Transactions on Information Theory.

[43]  Junzhou Huang,et al.  The Benefit of Group Sparsity , 2009 .

[44]  J. Lafferty,et al.  High-dimensional Ising model selection using ℓ1-regularized logistic regression , 2010, 1010.0311.

[45]  Martin J. Wainwright,et al.  Information-Theoretic Limits on Sparse Signal Recovery: Dense versus Sparse Measurement Matrices , 2008, IEEE Transactions on Information Theory.

[46]  Martin J. Wainwright,et al.  Estimation of (near) low-rank matrices with noise and high-dimensional scaling , 2009, ICML.

[47]  Martin J. Wainwright,et al.  Minimax Rates of Estimation for High-Dimensional Linear Regression Over $\ell_q$ -Balls , 2009, IEEE Transactions on Information Theory.