On the Power of Preconditioning in Sparse Linear Regression

Sparse linear regression is a fundamental problem in high-dimensional statistics, but strikingly little is known about how to efficiently solve it without restrictive conditions on the design matrix. We consider the (correlated) random design setting, where the covariates are independently drawn from a multivariate Gaussian N(0,Σ), for some n×n positive semi-definite matrix Σ, and seek estimators ŵ minimizing (ŵ−w∗)T Σ(ŵ−w∗), where w∗ is the k-sparse ground truth. Information theoretically, one can achieve strong error bounds with only O(k log n) samples for arbitrary Σ and w∗; however, no efficient algorithms are known to match these guarantees even with o(n) samples, without further assumptions on Σ or w∗. Yet there is little evidence for this gap in the random design setting: computational lower bounds are only known for worst-case design matrices. To date, random-design instances (i.e. specific covariance matrices Σ) have only been proven hard against the Lasso program and variants. More precisely, these “hard” instances can often be solved by Lasso after a simple change-of-basis (i.e. preconditioning). In this work, we give both upper and lower bounds clarifying the power of preconditioning as a tool for solving sparse linear regression problems. On the one hand, we show that the preconditioned Lasso can solve a large class of sparse linear regression problems nearly optimally: it succeeds whenever the dependency structure of the covariates, in the sense of the Markov property, has low treewidth — even if Σ is highly ill-conditioned. This upper bound builds on ideas from the wavelet and signal processing literature. As a special case of this result, we give an algorithm for sparse linear regression with covariates from an autoregressive time series model, where we also show that the (usual) Lasso provably fails. On the other hand, we construct (for the first time) random-design instances which are provably hard even for an optimally preconditioned Lasso. In fact, we complete our treewidth classification by proving that for any treewidth-t graph, there exists a Gaussian Markov Random Field on this graph such that the preconditioned Lasso, with any choice of preconditioner, requires Ω(t) samples to recover O(log n)-sparse signals when covariates are drawn from this model. kelner@mit.edu. This work was supported in part by NSF Large CCF-1565235, NSF Medium CCF-1955217, and NSF TRIPODS 1740751. fkoehler@mit.edu. This work was supported in part by NSF CAREER Award CCF-1453261, NSF Large CCF1565235, A. Moitra’s ONR Young Investigator Award and E. Mossel’s Vannevar Bush Faculty Fellowship ONRN00014-20-1-2826. raghum@cs.ucla.edu. This work was supported in part by NSF CAREER Award CCF-1553605 and NSF Small CCF-2007682 drohatgi@mit.edu. This work was supported in part by NSF Large CCF-1565235, NSF Medium CCF-1955217, and the MIT UROP Office. ar X iv :2 10 6. 09 20 7v 1 [ cs .L G ] 1 7 Ju n 20 21

[1]  Shuheng Zhou Restricted Eigenvalue Conditions on Subgaussian Random Matrices , 2009, 0912.4045.

[2]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[3]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[4]  M. Rudelson,et al.  Sharp transition of the invertibility of the adjacency matrices of sparse random graphs , 2018, Probability Theory and Related Fields.

[5]  Emmanuel J. Candès,et al.  Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information , 2004, IEEE Transactions on Information Theory.

[6]  Douglas L. Jones,et al.  A signal-dependent time-frequency representation: fast algorithm for optimal kernel design , 1994, IEEE Trans. Signal Process..

[7]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[8]  Sally Dong,et al.  A nearly-linear time algorithm for linear programs with small treewidth: a multiscale representation of robust central path , 2021, STOC.

[9]  Michal Pilipczuk,et al.  A ck n 5-Approximation Algorithm for Treewidth , 2016, SIAM J. Comput..

[10]  Inderjit S. Dhillon,et al.  Structured Sparse Regression via Greedy Hard Thresholding , 2016, NIPS.

[11]  Piotr Indyk,et al.  Sparse Recovery Using Sparse Random Matrices , 2010, LATIN.

[12]  A. Dalalyan,et al.  On the Prediction Performance of the Lasso , 2014, 1402.1700.

[13]  Benny Applebaum,et al.  On Basing Lower-Bounds for Learning on Worst-Case Assumptions , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[14]  S. Mallat A wavelet tour of signal processing , 1998 .

[15]  S. Geer,et al.  On the conditions used to prove oracle results for the Lasso , 2009, 0910.0722.

[16]  S. Levy,et al.  Reconstruction of a sparse spike train from a portion of its spectrum and application to high-resolution deconvolution , 1981 .

[17]  Martin J. Wainwright,et al.  Lower bounds on the performance of polynomial-time algorithms for sparse linear regression , 2014, COLT.

[18]  Michael I. Jordan,et al.  Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators , 2015, 1503.03188.

[19]  R. Tibshirani,et al.  The solution path of the generalized lasso , 2010, 1005.1971.

[20]  John Wilmes,et al.  Gradient Descent for One-Hidden-Layer Neural Networks: Polynomial Convergence and SQ Lower Bounds , 2018, COLT.

[21]  Bernard Chazelle,et al.  A theorem on polygon cutting with applications , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[22]  Akshay Krishnamurthy,et al.  Detecting Activations over Graphs using Spanning Tree Wavelet Bases , 2012, AISTATS.

[23]  S. Geer,et al.  The Lasso, correlated design, and improved oracle inequalities , 2011, 1107.0189.

[24]  Xi Chen,et al.  On Bayes Risk Lower Bounds , 2014, J. Mach. Learn. Res..

[25]  P. Bellec The noise barrier and the large signal bias of the Lasso and other convex estimators , 2018, 1804.01230.

[26]  David Gamarnik,et al.  Sparse High-Dimensional Linear Regression. Algorithmic Barriers and a Local Search Algorithm , 2017, 1711.04952.

[27]  Martin J. Wainwright,et al.  Restricted Eigenvalue Properties for Correlated Gaussian Designs , 2010, J. Mach. Learn. Res..

[28]  Michael Krivelevich,et al.  The Phase Transition in Site Percolation on Pseudo-Random Graphs , 2014, Electron. J. Comb..

[29]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[30]  Galen Reeves,et al.  The All-or-Nothing Phenomenon in Sparse Linear Regression , 2019, COLT.

[31]  A. Haar Zur Theorie der orthogonalen Funktionensysteme , 1910 .

[32]  Piotr Indyk,et al.  Simple and practical algorithm for sparse Fourier transform , 2012, SODA.

[33]  Dean P. Foster,et al.  Variable Selection is Hard , 2014, COLT.

[34]  Deanna Needell,et al.  Stable Image Reconstruction Using Total Variation Minimization , 2012, SIAM J. Imaging Sci..

[35]  Lei Qi,et al.  Sparse High Dimensional Models in Economics. , 2011, Annual review of economics.

[36]  L. Devroye,et al.  The total variation distance between high-dimensional Gaussians , 2018, 1810.08693.

[37]  Paul D. Seymour,et al.  Graph minors. V. Excluding a planar graph , 1986, J. Comb. Theory B.

[38]  Richard A. Davis,et al.  Time Series: Theory and Methods , 2013 .

[39]  M. Rudelson,et al.  Sparse reconstruction by convex relaxation: Fourier and Gaussian measurements , 2006, 2006 40th Annual Conference on Information Sciences and Systems.

[40]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[41]  Leonidas J. Guibas,et al.  Linear-time algorithms for visibility and shortest path problems inside triangulated simple polygons , 1987, Algorithmica.

[42]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[43]  Julia Chuzhoy,et al.  Towards Tight(er) Bounds for the Excluded Grid Theorem , 2019, SODA.

[44]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[45]  Prateek Jain,et al.  On Iterative Hard Thresholding Methods for High-dimensional M-Estimation , 2014, NIPS.

[46]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[47]  Abhimanyu Das,et al.  Algorithms for subset selection in linear regression , 2008, STOC.

[48]  Illtyd Trethowan Causality , 1938 .

[49]  Alexandros G. Dimakis,et al.  Restricted Strong Convexity Implies Weak Submodularity , 2016, The Annals of Statistics.

[50]  D. Bertsimas,et al.  Best Subset Selection via a Modern Optimization Lens , 2015, 1507.03133.

[51]  Piotr Indyk,et al.  Combining geometry and combinatorics: A unified approach to sparse signal recovery , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[52]  Abhimanyu Das,et al.  Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection , 2011, ICML.

[53]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[54]  Ankur Moitra,et al.  Learning Some Popular Gaussian Graphical Models without Condition Number Bounds , 2019, NeurIPS.

[55]  Jan-Christian Hü,et al.  Optimal rates for total variation denoising , 2016, COLT.

[56]  Peter G. Casazza,et al.  The exact constant for the ℓ_1-ℓ_2 norm inequality , 2017, Mathematical Inequalities & Applications.

[57]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[58]  James R. Lee,et al.  Improved approximation algorithms for minimum-weight vertex separators , 2005, STOC '05.

[59]  Nebojsa Jojic,et al.  A Comparative Framework for Preconditioned Lasso Algorithms , 2013, NIPS.

[60]  Shahar Mendelson,et al.  Learning without Concentration , 2014, COLT.

[61]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[62]  D. Donoho,et al.  Uncertainty principles and signal recovery , 1989 .

[63]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[64]  Sham M. Kakade,et al.  Random Design Analysis of Ridge Regression , 2012, COLT.

[65]  Dean P. Foster,et al.  The risk inflation criterion for multiple regression , 1994 .

[66]  Peter D. Hoff A EXPONENTIAL FAMILIES , 2013 .

[67]  Deanna Needell,et al.  CoSaMP: Iterative signal recovery from incomplete and inaccurate samples , 2008, ArXiv.

[68]  Sara A. van de Geer,et al.  On Tight Bounds for the Lasso , 2018, J. Mach. Learn. Res..

[69]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[70]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[71]  Mike E. Davies,et al.  Iterative Hard Thresholding for Compressed Sensing , 2008, ArXiv.

[72]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[73]  Y. She Sparse regression with exact clustering , 2008 .

[74]  Guy Bresler,et al.  Reducibility and Statistical-Computational Gaps from Secret Leakage , 2020, COLT.

[75]  Vladimir Koltchinskii,et al.  $L_1$-Penalization in Functional Linear Regression with Subgaussian Design , 2013, 1307.8137.

[76]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[77]  Karl Rohe,et al.  Preconditioning the Lasso for sign consistency , 2015 .

[78]  Bernhard Schölkopf,et al.  Elements of Causal Inference: Foundations and Learning Algorithms , 2017 .

[79]  A. Tsybakov,et al.  Slope meets Lasso: Improved oracle bounds and optimality , 2016, The Annals of Statistics.

[80]  Michael Kearns,et al.  Efficient noise-tolerant learning from statistical queries , 1993, STOC.

[81]  Chandra Chekuri,et al.  Polynomial bounds for the grid-minor theorem , 2013, J. ACM.

[82]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[83]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[84]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[85]  J WainwrightMartin Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (Lasso) , 2009 .

[86]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .