论文信息 - Simultaneous Support Recovery in High Dimensions: Benefits and Perils of Block $\ell _{1}/\ell _{\infty} $-Regularization - 字舞流文

Simultaneous Support Recovery in High Dimensions: Benefits and Perils of Block $\ell _{1}/\ell _{\infty} $-Regularization

Given a collection of <i>r</i> ≥ 2 linear regression problems in <i>p</i> dimensions, suppose that the regression coefficients share partially common supports of size at most <i>s</i>. This set-up suggests the use of ℓ<sub>1</sub>/ℓ<sub>∞</sub>-regularized regression for joint estimation of the <i>p</i>×<i>r</i> matrix of regression coefficients. We analyze the high-dimensional scaling of ℓ<sub>1</sub>/ℓ<sub>∞</sub>-regularized quadratic programming, considering both consistency rates in ℓ<sub>∞</sub>-norm, and how the minimal sample size <i>n</i> required for consistent variable selection scales with model dimension, sparsity, and overlap between the supports. We first establish bounds on the ℓ<sub>∞</sub>-error as well sufficient conditions for exact variable selection for fixed design matrices, as well as for designs drawn randomly from general Gaussian distributions. Specializing to the case <i>r</i> = 2 linear regression problems with standard Gaussian designs whose supports overlap in a fraction α ∈ [0,1] of their entries, we prove that ℓ<sub>1</sub>/ℓ<sub>∞</sub>-regularized method undergoes a phase transition characterized by the rescaled sample size θ<sub>1,∞</sub>(<i>n</i>, <i>p</i>, <i>s</i>, α) = <i>n</i>/{(4 - 3 α) <i>s</i> log(<i>p</i>-(2- α) <i>s</i>)}. An implication is that the use of ℓ<sub>1</sub>/ℓ<sub>∞</sub>-regularization yields improved statistical efficiency if the overlap parameter is large enough ( α >; 2/3), but has worse statistical efficiency than a naive Lasso-based approach for moderate to small overlap (α <; 2/3 ). Empirical simulations illustrate the close agreement between theory and actual behavior in practice. These results show that caution must be exercised in applying <i>ℓ</i><sub>1</sub>/ℓ<sub>∞</sub> block regularization: if the data does not match its structure very closely, it can impair statistical performance relative to computationally less expensive schemes.

Martin J. Wainwright | Sahand Negahban | M. Wainwright | S. Negahban

[1] 丸山徹. Convex Analysisの二,三の進展について , 1977 .

[2] M. Talagrand,et al. Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[3] J. Hiriart-Urruty,et al. Convex analysis and minimization algorithms , 1993 .

[4] Dimitri P. Bertsekas,et al. Nonlinear Programming , 1997 .

[5] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[6] Michael I. Jordan. Learning in Graphical Models , 1999, NATO ASI Series.

[7] Michael A. Saunders,et al. Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[8] S. Mallat. A wavelet tour of signal processing , 1998 .

[9] Eero P. Simoncelli. Bayesian Denoising of Visual Images in the Wavelet Domain , 1999 .

[10] V. Buldygin,et al. Metric characterization of random variables and random processes , 2000 .

[11] P. Massart,et al. Adaptive estimation of a quadratic functional by model selection , 2000 .

[12] M. Ledoux. The concentration of measure phenomenon , 2001 .

[13] S. Szarek,et al. Chapter 8 - Local Operator Theory, Random Matrices and Banach Spaces , 2001 .

[14] Jia Jie. Bayesian denoising of visual images in the wavelet domain , 2003 .

[15] Stephen J. Wright,et al. Simultaneous Variable Selection , 2005, Technometrics.

[16] D. Donoho,et al. Counting faces of randomly-projected polytopes when the projection radically lowers dimension , 2006, math/0607364.

[17] N. Meinshausen,et al. High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[18] Joel A. Tropp,et al. ALGORITHMS FOR SIMULTANEOUS SPARSE APPROXIMATION , 2006 .

[19] Martin J. Wainwright,et al. Sharp thresholds for high-dimensional and noisy recovery of sparsity , 2006, ArXiv.

[20] Joel A. Tropp,et al. Just relax: convex programming methods for identifying sparse signals in noise , 2006, IEEE Transactions on Information Theory.

[21] Massimiliano Pontil,et al. Multi-Task Feature Learning , 2006, NIPS.

[22] Peng Zhao,et al. On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[23] Joel A. Tropp,et al. Algorithms for simultaneous sparse approximation. Part I: Greedy pursuit , 2006, Signal Process..

[24] J. Lafferty,et al. Sparse additive models , 2007, 0711.4555.

[25] P. Zhao,et al. Grouped and Hierarchical Model Selection through Composite Absolute Penalties , 2007 .

[26] Larry A. Wasserman,et al. SpAM: Sparse Additive Models , 2007, NIPS.

[27] G. Obozinski. Joint covariate selection for grouped classification , 2007 .

[28] A. Rinaldo,et al. On the asymptotic properties of the group lasso estimator for linear models , 2008 .

[29] Larry A. Wasserman,et al. Nonparametric regression and classification with joint sparsity constraints , 2008, NIPS.

[30] Francis R. Bach,et al. Consistency of trace norm minimization , 2007, J. Mach. Learn. Res..

[31] Francis R. Bach,et al. Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[32] Michael I. Jordan,et al. Union support recovery in high-dimensional multivariate regression , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[33] Han Liu,et al. On the ℓ 1 -ℓ q Regularized Regression , 2008 .

[34] Han Liu,et al. On the ℓ 1 -ℓ q Regularized Regression , 2008, 0802.1517.

[35] P. Bühlmann,et al. The group lasso for logistic regression , 2008 .

[36] N. Meinshausen,et al. LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA , 2008, 0806.0145.

[37] S. Geer,et al. On the conditions used to prove oracle results for the Lasso , 2009, 0910.0722.

[38] P. Zhao,et al. The composite absolute penalties family for grouped and hierarchical variable selection , 2009, 0909.0411.

[39] Massimiliano Pontil,et al. Taking Advantage of Sparsity in Multi-Task Learning , 2009, COLT.

[40] Martin J. Wainwright,et al. Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting , 2009, IEEE Trans. Inf. Theory.

[41] P. Bickel,et al. SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[42] Martin J. Wainwright,et al. Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) , 2009, IEEE Transactions on Information Theory.

[43] Junzhou Huang,et al. The Benefit of Group Sparsity , 2009 .

[44] J. Lafferty,et al. High-dimensional Ising model selection using ℓ1-regularized logistic regression , 2010, 1010.0311.

[45] Martin J. Wainwright,et al. Information-Theoretic Limits on Sparse Signal Recovery: Dense versus Sparse Measurement Matrices , 2008, IEEE Transactions on Information Theory.

[46] Martin J. Wainwright,et al. Estimation of (near) low-rank matrices with noise and high-dimensional scaling , 2009, ICML.

[47] Martin J. Wainwright,et al. Minimax Rates of Estimation for High-Dimensional Linear Regression Over $\ell_q$ -Balls , 2009, IEEE Transactions on Information Theory.