Sparse sketches with small inversion bias

For a tall $n\times d$ matrix $A$ and a random $m\times n$ sketching matrix $S$, the sketched estimate of the inverse covariance matrix $(A^\top A)^{-1}$ is typically biased: $E[(\tilde A^\top\tilde A)^{-1}]\ne(A^\top A)^{-1}$, where $\tilde A=SA$. This phenomenon, which we call inversion bias, arises, e.g., in statistics and distributed optimization, when averaging multiple independently constructed estimates of quantities that depend on the inverse covariance. We develop a framework for analyzing inversion bias, based on our proposed concept of an $(\epsilon,\delta)$-unbiased estimator for random matrices. We show that when the sketching matrix $S$ is dense and has i.i.d. sub-gaussian entries, then after simple rescaling, the estimator $(\frac m{m-d}\tilde A^\top\tilde A)^{-1}$ is $(\epsilon,\delta)$-unbiased for $(A^\top A)^{-1}$ with a sketch of size $m=O(d+\sqrt d/\epsilon)$. This implies that for $m=O(d)$, the inversion bias of this estimator is $O(1/\sqrt d)$, which is much smaller than the $\Theta(1)$ approximation error obtained as a consequence of the subspace embedding guarantee for sub-gaussian sketches. We then propose a new sketching technique, called LEverage Score Sparsified (LESS) embeddings, which uses ideas from both data-oblivious sparse embeddings as well as data-aware leverage-based row sampling methods, to get $\epsilon$ inversion bias for sketch size $m=O(d\log d+\sqrt d/\epsilon)$ in time $O(\text{nnz}(A)\log n+md^2)$, where nnz is the number of non-zeros. The key techniques enabling our analysis include an extension of a classical inequality of Bai and Silverstein for random quadratic forms, which we call the Restricted Bai-Silverstein inequality; and anti-concentration of the Binomial distribution via the Paley-Zygmund inequality, which we use to prove a lower bound showing that leverage score sampling sketches generally do not achieve small inversion bias.

[1]  Parikshit Shah,et al.  Sketching Sparse Matrices, Covariances, and Graphs via Tensor Products , 2015, IEEE Transactions on Information Theory.

[2]  Yang Liu,et al.  Fast Relative-Error Approximation Algorithm for Ridge Regression , 2015, UAI.

[3]  Jianqing Fan,et al.  An Overview of the Estimation of Large Covariance and Precision Matrices , 2015, The Econometrics Journal.

[4]  S. Muthukrishnan,et al.  Sampling algorithms for l2 regression and applications , 2006, SODA '06.

[5]  Eric C. Chi,et al.  Stable Estimation of a Covariance Matrix Guided by Nuclear Norm Penalties , 2013, Comput. Stat. Data Anal..

[6]  J. L. Roux An Introduction to the Kalman Filter , 2003 .

[7]  Yuchen Zhang,et al.  DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[8]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices II: Computing a Low-Rank Approximation to a Matrix , 2006, SIAM J. Comput..

[9]  Tengyao Wang,et al.  Sparse principal component analysis via axis‐aligned random projections , 2017, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[10]  Manfred K. Warmuth,et al.  Leveraged volume sampling for linear regression , 2018, NeurIPS.

[11]  Shusen Wang,et al.  GIANT: Globally Improved Approximate Newton Method for Distributed Optimization , 2017, NeurIPS.

[12]  H. En Lower Bounds for Oblivious Subspace Embeddings , 2013 .

[13]  S. Muthukrishnan,et al.  Relative-Error CUR Matrix Decompositions , 2007, SIAM J. Matrix Anal. Appl..

[14]  Michael W. Mahoney,et al.  Determinantal Point Processes in Randomized Numerical Linear Algebra , 2020, Notices of the American Mathematical Society.

[15]  Shusen Wang,et al.  Sketched Ridge Regression: Optimization Perspective, Statistical Perspective, and Model Averaging , 2017, ICML.

[16]  Martin J. Wainwright,et al.  Randomized sketches of convex programs with sharp guarantees , 2014, 2014 IEEE International Symposium on Information Theory.

[17]  Michael W. Mahoney,et al.  Distributed estimation of the inverse Hessian by determinantal averaging , 2019, NeurIPS.

[18]  Michael W. Mahoney,et al.  Exact expressions for double descent and implicit regularization via surrogate random design , 2019, NeurIPS.

[19]  Michael W. Mahoney,et al.  Precise expressions for random projections: Low-rank approximation and randomized Newton , 2020, NeurIPS.

[20]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[21]  ˇ. Markoˇ ASYMPTOTIC EXPANSION FOR INVERSE MOMENTS OF BINOMIAL AND POISSON DISTRIBUTIONS , 2005 .

[22]  Per-Gunnar Martinsson,et al.  Randomized algorithms for the low-rank approximation of matrices , 2007, Proceedings of the National Academy of Sciences.

[23]  Kenneth L. Clarkson,et al.  Minimax experimental design: Bridging the gap between statistical and worst-case approaches to least squares regression , 2019, COLT.

[24]  Cameron Musco,et al.  Randomized Block Krylov Methods for Stronger and Faster Approximate Singular Value Decomposition , 2015, NIPS.

[25]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[26]  Carl D. Meyer,et al.  Matrix Analysis and Applied Linear Algebra , 2000 .

[27]  Michal Derezinski,et al.  Fast determinantal point processes via distortion-free intermediate sampling , 2018, COLT.

[28]  Volkan Cevher,et al.  Practical Sketching Algorithms for Low-Rank Matrix Approximation , 2016, SIAM J. Matrix Anal. Appl..

[29]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[30]  Dechang Chen,et al.  The Theory of the Design of Experiments , 2001, Technometrics.

[31]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[32]  Martin J. Wainwright,et al.  A More Powerful Two-Sample Test in High Dimensions using Random Projection , 2011, NIPS.

[33]  Petros Drineas,et al.  Lectures on Randomized Numerical Linear Algebra , 2017, IAS/Park City Mathematics Series.

[34]  Dean P. Foster,et al.  Faster Ridge Regression via the Subsampled Randomized Hadamard Transform , 2013, NIPS.

[35]  Michael W. Mahoney,et al.  Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression , 2012, STOC '13.

[36]  N. L. Johnson,et al.  Linear Statistical Inference and Its Applications , 1966 .

[37]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[38]  Gideon S. Mann,et al.  Distributed Training Strategies for the Structured Perceptron , 2010, NAACL.

[39]  Anja Vogler,et al.  An Introduction to Multivariate Statistical Analysis , 2004 .

[40]  Michael W. Mahoney,et al.  A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares , 2014, J. Mach. Learn. Res..

[41]  J. Hartigan Linear Bayesian Methods , 1969 .

[42]  M. Yuan,et al.  Model selection and estimation in the Gaussian graphical model , 2007 .

[43]  V. Marčenko,et al.  DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[44]  Edgar Dobriban,et al.  Asymptotics for Sketching in Least Squares Regression , 2018, NeurIPS.

[45]  Marko Znidaric,et al.  Asymptotic Expansion for Inverse Moments of Binomial and PoissonDistributions , 2005, math/0511226.

[46]  Richard A. Davis,et al.  Time Series: Theory and Methods , 2013 .

[47]  Nathan Halko,et al.  An Algorithm for the Principal Component Analysis of Large Data Sets , 2010, SIAM J. Sci. Comput..

[48]  David P. Woodruff,et al.  How to Reduce Dimension With PCA and Random Projections? , 2020, IEEE Transactions on Information Theory.

[49]  Mert Pilanci,et al.  Debiasing Distributed Second Order Optimization with Surrogate Sketching and Scaled Regularization , 2020, NeurIPS.

[50]  David P. Woodruff Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[51]  David Ruppert,et al.  RAPTT: An Exact Two-Sample Test in High Dimensions Using Random Projections , 2014, 1405.1792.

[52]  Alan M. Frieze,et al.  Fast monte-carlo algorithms for finding low-rank approximations , 2004, JACM.

[53]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[54]  J. W. Silverstein,et al.  Spectral Analysis of Large Dimensional Random Matrices , 2009 .

[55]  V. Rokhlin,et al.  A randomized algorithm for the approximation of matrices , 2006 .

[56]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[57]  Olivier Ledoit,et al.  Nonlinear Shrinkage Estimation of Large-Dimensional Covariance Matrices , 2011, 1207.5322.

[58]  Jean-Philippe Bouchaud,et al.  Cleaning large correlation matrices: tools from random matrix theory , 2016, 1610.08104.

[59]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[60]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[61]  Daniele Calandriello,et al.  Exact sampling of determinantal point processes with sublinear time preprocessing , 2019, NeurIPS.

[62]  V. Rokhlin,et al.  A fast randomized algorithm for the approximation of matrices ✩ , 2007 .

[63]  Alexander J. Smola,et al.  AIDE: Fast and Communication Efficient Distributed Optimization , 2016, ArXiv.

[64]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[65]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[66]  Ping Ma,et al.  A statistical perspective on algorithmic leveraging , 2013, J. Mach. Learn. Res..

[67]  Gideon S. Mann,et al.  Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models , 2009, NIPS.

[68]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[69]  Peter Richtárik,et al.  Federated Optimization: Distributed Machine Learning for On-Device Intelligence , 2016, ArXiv.

[70]  Edgar Dobriban,et al.  One-shot distributed ridge regression in high dimensions , 2019, J. Mach. Learn. Res..

[71]  T. Tao Topics in Random Matrix Theory , 2012 .

[72]  R. Samworth,et al.  Random‐projection ensemble classification , 2015, 1504.04595.

[73]  R. Couillet,et al.  Random Matrix Methods for Wireless Communications , 2011 .

[74]  Martin J. Wainwright,et al.  Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic Convergence , 2015, SIAM J. Optim..

[75]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[76]  R. Paley,et al.  A note on analytic functions in the unit circle , 1932, Mathematical Proceedings of the Cambridge Philosophical Society.

[77]  David P. Woodru Sketching as a Tool for Numerical Linear Algebra , 2014 .

[78]  Teodoro Collin RANDOM MATRIX THEORY , 2016 .

[79]  S. Muthukrishnan,et al.  Faster least squares approximation , 2007, Numerische Mathematik.

[80]  Manfred K. Warmuth,et al.  Reverse iterative volume sampling for linear regression , 2018, J. Mach. Learn. Res..

[81]  Michael B. Cohen,et al.  Nearly Tight Oblivious Subspace Embeddings by Trace Inequalities , 2016, SODA.

[82]  Bernard Chazelle,et al.  The Fast Johnson--Lindenstrauss Transform and Approximate Nearest Neighbors , 2009, SIAM J. Comput..

[83]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[84]  E. Dobriban,et al.  Distributed linear regression by averaging , 2018, The Annals of Statistics.

[85]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2012, STOC '13.

[86]  Martin J. Wainwright,et al.  Iterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained Least-Squares , 2014, J. Mach. Learn. Res..

[87]  Kurt Keutzer,et al.  PyHessian: Neural Networks Through the Lens of the Hessian , 2019, 2020 IEEE International Conference on Big Data (Big Data).

[88]  D. Burkholder Distribution Function Inequalities for Martingales , 1973 .

[89]  Michael Jackson,et al.  Optimal Design of Experiments , 1994 .

[90]  Daniele Calandriello,et al.  Sampling from a k-DPP without looking at all items , 2020, NeurIPS.

[91]  Santosh S. Vempala,et al.  The Random Projection Method , 2005, DIMACS Series in Discrete Mathematics and Theoretical Computer Science.

[92]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[93]  Bernard Chazelle,et al.  Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform , 2006, STOC '06.

[94]  Calyampudi Radhakrishna Rao,et al.  Linear Statistical Inference and its Applications , 1967 .

[95]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[96]  Edgar Dobriban,et al.  Ridge Regression: Structure, Cross-Validation, and Sketching , 2020, ICLR.

[97]  Kurt Keutzer,et al.  ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning , 2020, ArXiv.

[98]  Huy L. Nguyen,et al.  OSNAP: Faster Numerical Linear Algebra Algorithms via Sparser Subspace Embeddings , 2012, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[99]  C. Tracy,et al.  Introduction to Random Matrices , 1992, hep-th/9210073.

[100]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[101]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[102]  Alfred O. Hero,et al.  $l_{0}$ Sparse Inverse Covariance Estimation , 2014, IEEE Transactions on Signal Processing.

[103]  T. Cai,et al.  A Constrained ℓ1 Minimization Approach to Sparse Precision Matrix Estimation , 2011, 1102.2233.

[104]  Jianqing Fan,et al.  Sparsistency and Rates of Convergence in Large Covariance Matrix Estimation. , 2007, Annals of statistics.

[105]  M. Rudelson,et al.  Hanson-Wright inequality and sub-gaussian concentration , 2013 .

[106]  Manfred K. Warmuth,et al.  Unbiased estimators for random design regression , 2019, ArXiv.