Sparse sketches with small inversion bias

For a tall $n\times d$ matrix $A$ and a random $m\times n$ sketching matrix $S$, the sketched estimate of the inverse covariance matrix $(A^\top A)^{-1}$ is typically biased: $E[(\tilde A^\top\tilde A)^{-1}]\ne(A^\top A)^{-1}$, where $\tilde A=SA$. This phenomenon, which we call inversion bias, arises, e.g., in statistics and distributed optimization, when averaging multiple independently constructed estimates of quantities that depend on the inverse covariance. We develop a framework for analyzing inversion bias, based on our proposed concept of an $(\epsilon,\delta)$-unbiased estimator for random matrices. We show that when the sketching matrix $S$ is dense and has i.i.d. sub-gaussian entries, then after simple rescaling, the estimator $(\frac m{m-d}\tilde A^\top\tilde A)^{-1}$ is $(\epsilon,\delta)$-unbiased for $(A^\top A)^{-1}$ with a sketch of size $m=O(d+\sqrt d/\epsilon)$. This implies that for $m=O(d)$, the inversion bias of this estimator is $O(1/\sqrt d)$, which is much smaller than the $\Theta(1)$ approximation error obtained as a consequence of the subspace embedding guarantee for sub-gaussian sketches. We then propose a new sketching technique, called LEverage Score Sparsified (LESS) embeddings, which uses ideas from both data-oblivious sparse embeddings as well as data-aware leverage-based row sampling methods, to get $\epsilon$ inversion bias for sketch size $m=O(d\log d+\sqrt d/\epsilon)$ in time $O(\text{nnz}(A)\log n+md^2)$, where nnz is the number of non-zeros. The key techniques enabling our analysis include an extension of a classical inequality of Bai and Silverstein for random quadratic forms, which we call the Restricted Bai-Silverstein inequality; and anti-concentration of the Binomial distribution via the Paley-Zygmund inequality, which we use to prove a lower bound showing that leverage score sampling sketches generally do not achieve small inversion bias.

[1]  Tengyao Wang,et al.  Sparse principal component analysis via axis‐aligned random projections , 2017, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[2]  Manfred K. Warmuth,et al.  Leveraged volume sampling for linear regression , 2018, NeurIPS.

[3]  Michael B. Cohen,et al.  Nearly Tight Oblivious Subspace Embeddings by Trace Inequalities , 2016, SODA.

[4]  Kurt Keutzer,et al.  ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning , 2020, AAAI.

[5]  J. Hartigan Linear Bayesian Methods , 1969 .

[6]  Shusen Wang,et al.  GIANT: Globally Improved Approximate Newton Method for Distributed Optimization , 2017, NeurIPS.

[7]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2012, STOC '13.

[8]  David P. Woodru Sketching as a Tool for Numerical Linear Algebra , 2014 .

[9]  Dechang Chen,et al.  The Theory of the Design of Experiments , 2001, Technometrics.

[10]  Santosh S. Vempala,et al.  The Random Projection Method , 2005, DIMACS Series in Discrete Mathematics and Theoretical Computer Science.

[11]  Huy L. Nguyen,et al.  OSNAP: Faster Numerical Linear Algebra Algorithms via Sparser Subspace Embeddings , 2012, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[12]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[13]  Kenneth L. Clarkson,et al.  Minimax experimental design: Bridging the gap between statistical and worst-case approaches to least squares regression , 2019, COLT.

[14]  R. Couillet,et al.  Random Matrix Methods for Wireless Communications , 2011 .

[15]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[16]  Richard A. Davis,et al.  Time Series: Theory and Methods , 2013 .

[17]  Manfred K. Warmuth,et al.  Unbiased estimators for random design regression , 2019, ArXiv.

[18]  V. Rokhlin,et al.  A fast randomized algorithm for the approximation of matrices ✩ , 2007 .

[19]  Edgar Dobriban,et al.  Ridge Regression: Structure, Cross-Validation, and Sketching , 2020, ICLR.

[20]  Volkan Cevher,et al.  Practical Sketching Algorithms for Low-Rank Matrix Approximation , 2016, SIAM J. Matrix Anal. Appl..

[21]  Jianqing Fan,et al.  An Overview of the Estimation of Large Covariance and Precision Matrices , 2015, The Econometrics Journal.

[22]  R. Paley,et al.  A note on analytic functions in the unit circle , 1932, Mathematical Proceedings of the Cambridge Philosophical Society.

[23]  Cameron Musco,et al.  Randomized Block Krylov Methods for Stronger and Faster Approximate Singular Value Decomposition , 2015, NIPS.

[24]  S. Muthukrishnan,et al.  Relative-Error CUR Matrix Decompositions , 2007, SIAM J. Matrix Anal. Appl..

[25]  D. Burkholder Distribution Function Inequalities for Martingales , 1973 .

[26]  David P. Woodruff,et al.  How to Reduce Dimension With PCA and Random Projections? , 2020, IEEE Transactions on Information Theory.

[27]  Calyampudi Radhakrishna Rao,et al.  Linear Statistical Inference and its Applications , 1967 .

[28]  Michael W. Mahoney,et al.  A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares , 2014, J. Mach. Learn. Res..

[29]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[30]  Yang Liu,et al.  Fast Relative-Error Approximation Algorithm for Ridge Regression , 2015, UAI.

[31]  Per-Gunnar Martinsson,et al.  Randomized algorithms for the low-rank approximation of matrices , 2007, Proceedings of the National Academy of Sciences.

[32]  Gideon S. Mann,et al.  Distributed Training Strategies for the Structured Perceptron , 2010, NAACL.

[33]  V. Marčenko,et al.  DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[34]  David P. Woodruff Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[35]  T. Cai,et al.  A Constrained ℓ1 Minimization Approach to Sparse Precision Matrix Estimation , 2011, 1102.2233.

[36]  Eric C. Chi,et al.  Stable Estimation of a Covariance Matrix Guided by Nuclear Norm Penalties , 2013, Comput. Stat. Data Anal..

[37]  Carl D. Meyer,et al.  Matrix Analysis and Applied Linear Algebra , 2000 .

[39]  S. Muthukrishnan,et al.  Sampling algorithms for l2 regression and applications , 2006, SODA '06.

[40]  Michael W. Mahoney,et al.  Distributed estimation of the inverse Hessian by determinantal averaging , 2019, NeurIPS.

[41]  M. Rudelson,et al.  Hanson-Wright inequality and sub-gaussian concentration , 2013 .

[42]  Manfred K. Warmuth,et al.  Reverse iterative volume sampling for linear regression , 2018, J. Mach. Learn. Res..

[43]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[44]  Martin J. Wainwright,et al.  Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic Convergence , 2015, SIAM J. Optim..

[45]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[46]  Shusen Wang,et al.  Sketched Ridge Regression: Optimization Perspective, Statistical Perspective, and Model Averaging , 2017, ICML.

[47]  Olivier Ledoit,et al.  Nonlinear Shrinkage Estimation of Large-Dimensional Covariance Matrices , 2011, 1207.5322.

[48]  Peter Richtárik,et al.  Federated Optimization: Distributed Machine Learning for On-Device Intelligence , 2016, ArXiv.

[49]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[50]  Jianqing Fan,et al.  Sparsistency and Rates of Convergence in Large Covariance Matrix Estimation. , 2007, Annals of statistics.

[51]  Edgar Dobriban,et al.  Asymptotics for Sketching in Least Squares Regression , 2018, NeurIPS.

[52]  C. Tracy,et al.  Introduction to Random Matrices , 1992, hep-th/9210073.

[53]  Daniele Calandriello,et al.  Exact sampling of determinantal point processes with sublinear time preprocessing , 2019, NeurIPS.

[54]  H. En Lower Bounds for Oblivious Subspace Embeddings , 2013 .

[55]  Michael W. Mahoney,et al.  Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression , 2012, STOC '13.

[56]  Daniele Calandriello,et al.  Sampling from a k-DPP without looking at all items , 2020, NeurIPS.

[57]  T. Tao Topics in Random Matrix Theory , 2012 .

[58]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[59]  J. L. Roux An Introduction to the Kalman Filter , 2003 .

[60]  E. Dobriban,et al.  Distributed linear regression by averaging , 2018, The Annals of Statistics.

[61]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[62]  N. L. Johnson,et al.  Linear Statistical Inference and Its Applications , 1966 .

[63]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[64]  Jean-Philippe Bouchaud,et al.  Cleaning large correlation matrices: tools from random matrix theory , 2016, 1610.08104.

[65]  Michael W. Mahoney,et al.  RandNLA , 2016, Commun. ACM.

[66]  Michael W. Mahoney,et al.  Determinantal Point Processes in Randomized Numerical Linear Algebra , 2020, Notices of the American Mathematical Society.

[67]  Michal Derezinski,et al.  Fast determinantal point processes via distortion-free intermediate sampling , 2018, COLT.

[68]  Alfred O. Hero,et al.  $l_{0}$ Sparse Inverse Covariance Estimation , 2014, IEEE Transactions on Signal Processing.

[69]  Dean P. Foster,et al.  Faster Ridge Regression via the Subsampled Randomized Hadamard Transform , 2013, NIPS.

[70]  Alan M. Frieze,et al.  Fast monte-carlo algorithms for finding low-rank approximations , 2004, JACM.

[71]  Bernard Chazelle,et al.  The Fast Johnson--Lindenstrauss Transform and Approximate Nearest Neighbors , 2009, SIAM J. Comput..

[72]  M. Yuan,et al.  Model selection and estimation in the Gaussian graphical model , 2007 .

[73]  Nathan Halko,et al.  An Algorithm for the Principal Component Analysis of Large Data Sets , 2010, SIAM J. Sci. Comput..

[74]  Martin J. Wainwright,et al.  A More Powerful Two-Sample Test in High Dimensions using Random Projection , 2011, NIPS.

[75]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[76]  Yuchen Zhang,et al.  DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[77]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[78]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[79]  Michael W. Mahoney,et al.  Exact expressions for double descent and implicit regularization via surrogate random design , 2019, NeurIPS.

[80]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[81]  Teodoro Collin RANDOM MATRIX THEORY , 2016 .

[82]  Michael Jackson,et al.  Optimal Design of Experiments , 1994 .

[83]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[84]  Michael W. Mahoney,et al.  PyHessian: Neural Networks Through the Lens of the Hessian , 2019, 2020 IEEE International Conference on Big Data (Big Data).

[85]  Parikshit Shah,et al.  Sketching Sparse Matrices, Covariances, and Graphs via Tensor Products , 2015, IEEE Transactions on Information Theory.

[86]  Petros Drineas,et al.  Lectures on Randomized Numerical Linear Algebra , 2017, IAS/Park City Mathematics Series.

[87]  David Ruppert,et al.  RAPTT: An Exact Two-Sample Test in High Dimensions Using Random Projections , 2014, 1405.1792.

[88]  Martin J. Wainwright,et al.  Randomized sketches of convex programs with sharp guarantees , 2014, 2014 IEEE International Symposium on Information Theory.

[89]  Petros Drineas,et al.  FAST MONTE CARLO ALGORITHMS FOR MATRICES II: COMPUTING A LOW-RANK APPROXIMATION TO A MATRIX∗ , 2004 .

[90]  J. W. Silverstein,et al.  Spectral Analysis of Large Dimensional Random Matrices , 2009 .

[91]  Yueqi Sheng,et al.  One-shot distributed ridge regression in high dimensions , 2019, ICML.

[92]  Bernard Chazelle,et al.  Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform , 2006, STOC '06.

[93]  Ping Ma,et al.  A statistical perspective on algorithmic leveraging , 2013, J. Mach. Learn. Res..

[94]  Gideon S. Mann,et al.  Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models , 2009, NIPS.

[95]  R. Samworth,et al.  Random‐projection ensemble classification , 2015, 1504.04595.

[96]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[97]  Mert Pilanci,et al.  Debiasing Distributed Second Order Optimization with Surrogate Sketching and Scaled Regularization , 2020, NeurIPS.

[98]  Alexander J. Smola,et al.  AIDE: Fast and Communication Efficient Distributed Optimization , 2016, ArXiv.

[99]  V. Rokhlin,et al.  A randomized algorithm for the approximation of matrices , 2006 .

[100]  Anja Vogler,et al.  An Introduction to Multivariate Statistical Analysis , 2004 .

[101]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[102]  Marko Znidaric,et al.  Asymptotic Expansion for Inverse Moments of Binomial and PoissonDistributions , 2005, math/0511226.

[103]  S. Muthukrishnan,et al.  Faster least squares approximation , 2007, Numerische Mathematik.

[104]  Martin J. Wainwright,et al.  Iterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained Least-Squares , 2014, J. Mach. Learn. Res..

[105]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[106]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..