Optimal Sketching Bounds for Sparse Linear Regression

We study oblivious sketching for $k$-sparse linear regression under various loss functions such as an $\ell_p$ norm, or from a broad class of hinge-like loss functions, which includes the logistic and ReLU losses. We show that for sparse $\ell_2$ norm regression, there is a distribution over oblivious sketches with $\Theta(k\log(d/k)/\varepsilon^2)$ rows, which is tight up to a constant factor. This extends to $\ell_p$ loss with an additional additive $O(k\log(k/\varepsilon)/\varepsilon^2)$ term in the upper bound. This establishes a surprising separation from the related sparse recovery problem, which is an important special case of sparse regression. For this problem, under the $\ell_2$ norm, we observe an upper bound of $O(k \log (d)/\varepsilon + k\log(k/\varepsilon)/\varepsilon^2)$ rows, showing that sparse recovery is strictly easier to sketch than sparse regression. For sparse regression under hinge-like loss functions including sparse logistic and sparse ReLU regression, we give the first known sketching bounds that achieve $o(d)$ rows showing that $O(\mu^2 k\log(\mu n d/\varepsilon)/\varepsilon^2)$ rows suffice, where $\mu$ is a natural complexity parameter needed to obtain relative error bounds for these loss functions. We again show that this dimension is tight, up to lower order terms and the dependence on $\mu$. Finally, we show that similar sketching bounds can be achieved for LASSO regression, a popular convex relaxation of sparse regression, where one aims to minimize $\|Ax-b\|_2^2+\lambda\|x\|_1$ over $x\in\mathbb{R}^d$. We show that sketching dimension $O(\log(d)/(\lambda \varepsilon)^2)$ suffices and that the dependence on $d$ and $\lambda$ is tight.

[1]  David P. Woodruff,et al.  Almost Linear Constant-Factor Sketching for 𝓁1 and Logistic Regression , 2023, ICLR.

[2]  David P. Woodruff,et al.  Online Lewis Weight Sampling , 2022, SODA.

[3]  Alexander Munteanu,et al.  p-Generalized Probit Regression and Scalable Maximum Likelihood Estimation via Sketching and Coresets , 2022, AISTATS.

[4]  David P. Woodruff,et al.  Oblivious Sketching for Logistic Regression , 2021, ICML.

[5]  Anup B. Rao,et al.  Coresets for Classification - Simplified and Strengthened , 2021, NeurIPS.

[6]  David P. Woodruff,et al.  Exponentially Improved Dimensionality Reduction for 𝓁1: Subspace Embeddings and Independence Testing , 2021, COLT.

[7]  O. Papaspiliopoulos High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .

[8]  Gonçalo Abecasis,et al.  Computationally efficient whole-genome regression for quantitative and binary traits , 2020, Nature Genetics.

[9]  Yuantao Gu,et al.  Lower Bound for RIP Constants and Concentration of Sum of Top Order Statistics , 2019, IEEE Transactions on Signal Processing.

[10]  David P. Woodruff,et al.  Tight Bounds for ℓp Oblivious Subspace Embeddings , 2018, SODA.

[11]  H. Poor,et al.  Analytical properties of generalized Gaussian distributions , 2018, Journal of Statistical Distributions and Applications.

[12]  Sung Min Park,et al.  Sparse PCA from Sparse Linear Regression , 2018, NeurIPS.

[13]  Jelani Nelson,et al.  Optimal terminal dimensionality reduction in Euclidean space , 2018, STOC.

[14]  David P. Woodruff,et al.  On Coresets for Logistic Regression , 2018, NeurIPS.

[15]  Eric Price,et al.  Active Regression via Linear-Sample Sparsification , 2017, COLT.

[16]  Tengyuan Liang,et al.  Adaptive Feature Selection: Computationally Efficient Online Sparse Linear Regression under RIP , 2017, ICML.

[17]  Piotr Indyk,et al.  Approximate Sparse Linear Regression , 2016, ICALP.

[18]  David P. Woodruff,et al.  Nearly-optimal bounds for sparse recovery in generic norms, with applications to k-median sketching , 2015, SODA.

[19]  David P. Woodruff,et al.  Sketching for M-Estimators: A Unified Approach to Robust Regression , 2015, SODA.

[20]  Dean P. Foster,et al.  Variable Selection is Hard , 2014, COLT.

[21]  David P. Woodruff Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[22]  Huy L. Nguyen,et al.  Lower Bounds for Oblivious Subspace Embeddings , 2013, ICALP.

[23]  David P. Woodruff,et al.  Subspace Embeddings and \(\ell_p\)-Regression Using Exponential Random Variables , 2013, COLT.

[24]  Santosh S. Vempala,et al.  Principal Component Analysis and Higher Correlations for Distributed Data , 2013, COLT.

[25]  T. Cai,et al.  Sparse PCA: Optimal rates and adaptive estimation , 2012, 1211.1309.

[26]  S. Janson Stable distributions , 2011, 1112.0220.

[27]  David P. Woodruff,et al.  Subspace embeddings for the L1-norm with applications , 2011, STOC '11.

[28]  David P. Woodruff,et al.  Lower bounds for sparse recovery , 2010, SODA '10.

[29]  David P. Woodruff,et al.  Numerical linear algebra in the streaming model , 2009, STOC '09.

[30]  Abhimanyu Das,et al.  Algorithms for subset selection in linear regression , 2008, STOC.

[31]  M. Wainwright,et al.  High-dimensional analysis of semidefinite relaxations for sparse principal components , 2008, 2008 IEEE International Symposium on Information Theory.

[32]  R. DeVore,et al.  A Simple Proof of the Restricted Isometry Property for Random Matrices , 2008 .

[33]  Anirban Dasgupta,et al.  Sampling algorithms and coresets for ℓp regression , 2007, SODA '08.

[34]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[35]  Joel A. Tropp,et al.  Just relax: convex programming methods for identifying sparse signals in noise , 2006, IEEE Transactions on Information Theory.

[36]  Joel A. Tropp,et al.  Algorithms for simultaneous sparse approximation. Part I: Greedy pursuit , 2006, Signal Process..

[37]  J. Tropp Algorithms for simultaneous sparse approximation. Part II: Convex relaxation , 2006, Signal Process..

[38]  S. Muthukrishnan,et al.  Sampling algorithms for l2 regression and applications , 2006, SODA '06.

[39]  E. Candès,et al.  Stable signal recovery from incomplete and inaccurate measurements , 2005, math/0503066.

[40]  K. Clarkson Subgradient and sampling algorithms for l1 regression , 2005, SODA '05.

[41]  S. Muthukrishnan,et al.  Approximation of functions over redundant dictionaries using coherence , 2003, SODA '03.

[42]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[43]  P. Indyk Stable distributions, pseudorandom generators, embeddings and data stream computation , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[44]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[45]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[46]  P. Masani Generalizations of P. Lévy's inversion theorem , 1977 .

[47]  W. Bednorz,et al.  ON TAILS OF SYMMETRIC AND TOTALLY ASYMMETRIC α-STABLE , 2020 .

[48]  Abhimanyu Das,et al.  Approximate Submodularity and its Applications: Subset Selection, Sparse Approximation and Dictionary Selection , 2018, J. Mach. Learn. Res..

[49]  David P. Woodruff,et al.  ( 1 + )-approximate Sparse Recovery , 2011 .

[50]  W. Härdle,et al.  Statistical Tools for Finance and Insurance , 2003 .

[51]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[52]  A. Atkinson Subset Selection in Regression , 1992 .

[53]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[54]  F. e. Calcul des Probabilités , 1889, Nature.