论文信息 - Sparse sketches with small inversion bias

Sparse sketches with small inversion bias

For a tall $n\times d$ matrix $A$ and a random $m\times n$ sketching matrix $S$, the sketched estimate of the inverse covariance matrix $(A^\top A)^{-1}$ is typically biased: $E[(\tilde A^\top\tilde A)^{-1}]\ne(A^\top A)^{-1}$, where $\tilde A=SA$. This phenomenon, which we call inversion bias, arises, e.g., in statistics and distributed optimization, when averaging multiple independently constructed estimates of quantities that depend on the inverse covariance. We develop a framework for analyzing inversion bias, based on our proposed concept of an $(\epsilon,\delta)$-unbiased estimator for random matrices. We show that when the sketching matrix $S$ is dense and has i.i.d. sub-gaussian entries, then after simple rescaling, the estimator $(\frac m{m-d}\tilde A^\top\tilde A)^{-1}$ is $(\epsilon,\delta)$-unbiased for $(A^\top A)^{-1}$ with a sketch of size $m=O(d+\sqrt d/\epsilon)$. This implies that for $m=O(d)$, the inversion bias of this estimator is $O(1/\sqrt d)$, which is much smaller than the $\Theta(1)$ approximation error obtained as a consequence of the subspace embedding guarantee for sub-gaussian sketches. We then propose a new sketching technique, called LEverage Score Sparsified (LESS) embeddings, which uses ideas from both data-oblivious sparse embeddings as well as data-aware leverage-based row sampling methods, to get $\epsilon$ inversion bias for sketch size $m=O(d\log d+\sqrt d/\epsilon)$ in time $O(\text{nnz}(A)\log n+md^2)$, where nnz is the number of non-zeros. The key techniques enabling our analysis include an extension of a classical inequality of Bai and Silverstein for random quadratic forms, which we call the Restricted Bai-Silverstein inequality; and anti-concentration of the Binomial distribution via the Paley-Zygmund inequality, which we use to prove a lower bound showing that leverage score sampling sketches generally do not achieve small inversion bias.

[1] Tengyao Wang,et al. Sparse principal component analysis via axis‐aligned random projections , 2017, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[2] Manfred K. Warmuth,et al. Leveraged volume sampling for linear regression , 2018, NeurIPS.

[3] Michael B. Cohen,et al. Nearly Tight Oblivious Subspace Embeddings by Trace Inequalities , 2016, SODA.

[4] Kurt Keutzer,et al. ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning , 2020, AAAI.

[5] J. Hartigan. Linear Bayesian Methods , 1969 .

[6] Shusen Wang,et al. GIANT: Globally Improved Approximate Newton Method for Distributed Optimization , 2017, NeurIPS.

[7] David P. Woodruff,et al. Low rank approximation and regression in input sparsity time , 2012, STOC '13.

[8] David P. Woodru. Sketching as a Tool for Numerical Linear Algebra , 2014 .

[9] Dechang Chen,et al. The Theory of the Design of Experiments , 2001, Technometrics.

[10] Santosh S. Vempala,et al. The Random Projection Method , 2005, DIMACS Series in Discrete Mathematics and Theoretical Computer Science.

[11] Huy L. Nguyen,et al. OSNAP: Faster Numerical Linear Algebra Algorithms via Sparser Subspace Embeddings , 2012, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[12] Andrea Montanari,et al. The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[13] Kenneth L. Clarkson,et al. Minimax experimental design: Bridging the gap between statistical and worst-case approaches to least squares regression , 2019, COLT.

[14] R. Couillet,et al. Random Matrix Methods for Wireless Communications , 2011 .

[15] Martin J. Wainwright,et al. Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[16] Richard A. Davis,et al. Time Series: Theory and Methods , 2013 .

[17] Manfred K. Warmuth,et al. Unbiased estimators for random design regression , 2019, ArXiv.

[18] V. Rokhlin,et al. A fast randomized algorithm for the approximation of matrices ✩ , 2007 .

[19] Edgar Dobriban,et al. Ridge Regression: Structure, Cross-Validation, and Sketching , 2020, ICLR.

[20] Volkan Cevher,et al. Practical Sketching Algorithms for Low-Rank Matrix Approximation , 2016, SIAM J. Matrix Anal. Appl..

[21] Jianqing Fan,et al. An Overview of the Estimation of Large Covariance and Precision Matrices , 2015, The Econometrics Journal.

[22] R. Paley,et al. A note on analytic functions in the unit circle , 1932, Mathematical Proceedings of the Cambridge Philosophical Society.

[23] Cameron Musco,et al. Randomized Block Krylov Methods for Stronger and Faster Approximate Singular Value Decomposition , 2015, NIPS.

[24] S. Muthukrishnan,et al. Relative-Error CUR Matrix Decompositions , 2007, SIAM J. Matrix Anal. Appl..

[25] D. Burkholder. Distribution Function Inequalities for Martingales , 1973 .

[26] David P. Woodruff,et al. How to Reduce Dimension With PCA and Random Projections? , 2020, IEEE Transactions on Information Theory.

[27] Calyampudi Radhakrishna Rao,et al. Linear Statistical Inference and its Applications , 1967 .

[28] Michael W. Mahoney,et al. A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares , 2014, J. Mach. Learn. Res..

[29] Trevor Hastie,et al. The Elements of Statistical Learning , 2001 .

[30] Yang Liu,et al. Fast Relative-Error Approximation Algorithm for Ridge Regression , 2015, UAI.

[31] Per-Gunnar Martinsson,et al. Randomized algorithms for the low-rank approximation of matrices , 2007, Proceedings of the National Academy of Sciences.

[32] Gideon S. Mann,et al. Distributed Training Strategies for the Structured Perceptron , 2010, NAACL.

[33] V. Marčenko,et al. DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[34] David P. Woodruff. Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[35] T. Cai,et al. A Constrained ℓ1 Minimization Approach to Sparse Precision Matrix Estimation , 2011, 1102.2233.

[36] Eric C. Chi,et al. Stable Estimation of a Covariance Matrix Guided by Nuclear Norm Penalties , 2013, Comput. Stat. Data Anal..

[37] Carl D. Meyer,et al. Matrix Analysis and Applied Linear Algebra , 2000 .

[39] S. Muthukrishnan,et al. Sampling algorithms for l2 regression and applications , 2006, SODA '06.

[40] Michael W. Mahoney,et al. Distributed estimation of the inverse Hessian by determinantal averaging , 2019, NeurIPS.

[41] M. Rudelson,et al. Hanson-Wright inequality and sub-gaussian concentration , 2013 .

[42] Manfred K. Warmuth,et al. Reverse iterative volume sampling for linear regression , 2018, J. Mach. Learn. Res..

[43] D K Smith,et al. Numerical Optimization , 2001, J. Oper. Res. Soc..

[44] Martin J. Wainwright,et al. Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic Convergence , 2015, SIAM J. Optim..

[45] David P. Woodruff,et al. Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[46] Shusen Wang,et al. Sketched Ridge Regression: Optimization Perspective, Statistical Perspective, and Model Averaging , 2017, ICML.

[47] Olivier Ledoit,et al. Nonlinear Shrinkage Estimation of Large-Dimensional Covariance Matrices , 2011, 1207.5322.

[48] Peter Richtárik,et al. Federated Optimization: Distributed Machine Learning for On-Device Intelligence , 2016, ArXiv.

[49] Joel A. Tropp,et al. User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[50] Jianqing Fan,et al. Sparsistency and Rates of Convergence in Large Covariance Matrix Estimation. , 2007, Annals of statistics.

[51] Edgar Dobriban,et al. Asymptotics for Sketching in Least Squares Regression , 2018, NeurIPS.

[52] C. Tracy,et al. Introduction to Random Matrices , 1992, hep-th/9210073.

[53] Daniele Calandriello,et al. Exact sampling of determinantal point processes with sublinear time preprocessing , 2019, NeurIPS.

[54] H. En. Lower Bounds for Oblivious Subspace Embeddings , 2013 .

[55] Michael W. Mahoney,et al. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression , 2012, STOC '13.

[56] Daniele Calandriello,et al. Sampling from a k-DPP without looking at all items , 2020, NeurIPS.

[57] T. Tao. Topics in Random Matrix Theory , 2012 .

[58] Tamás Sarlós,et al. Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[59] J. L. Roux. An Introduction to the Kalman Filter , 2003 .

[60] E. Dobriban,et al. Distributed linear regression by averaging , 2018, The Annals of Statistics.

[61] Ohad Shamir,et al. Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[62] N. L. Johnson,et al. Linear Statistical Inference and Its Applications , 1966 .

[63] Moses Charikar,et al. Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[64] Jean-Philippe Bouchaud,et al. Cleaning large correlation matrices: tools from random matrix theory , 2016, 1610.08104.

[65] Michael W. Mahoney,et al. RandNLA , 2016, Commun. ACM.

[66] Michael W. Mahoney,et al. Determinantal Point Processes in Randomized Numerical Linear Algebra , 2020, Notices of the American Mathematical Society.

[67] Michal Derezinski,et al. Fast determinantal point processes via distortion-free intermediate sampling , 2018, COLT.

[68] Alfred O. Hero,et al. $l_{0}$ Sparse Inverse Covariance Estimation , 2014, IEEE Transactions on Signal Processing.

[69] Dean P. Foster,et al. Faster Ridge Regression via the Subsampled Randomized Hadamard Transform , 2013, NIPS.

[70] Alan M. Frieze,et al. Fast monte-carlo algorithms for finding low-rank approximations , 2004, JACM.

[71] Bernard Chazelle,et al. The Fast Johnson--Lindenstrauss Transform and Approximate Nearest Neighbors , 2009, SIAM J. Comput..

[72] M. Yuan,et al. Model selection and estimation in the Gaussian graphical model , 2007 .

[73] Nathan Halko,et al. An Algorithm for the Principal Component Analysis of Large Data Sets , 2010, SIAM J. Sci. Comput..

[74] Martin J. Wainwright,et al. A More Powerful Two-Sample Test in High Dimensions using Random Projection , 2011, NIPS.

[75] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[76] Yuchen Zhang,et al. DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[77] Nathan Halko,et al. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[78] Heinz H. Bauschke,et al. Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[79] Michael W. Mahoney,et al. Exact expressions for double descent and implicit regularization via surrogate random design , 2019, NeurIPS.

[80] Peter Richtárik,et al. Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[81] Teodoro Collin. RANDOM MATRIX THEORY , 2016 .

[82] Michael Jackson,et al. Optimal Design of Experiments , 1994 .

[83] N. Meinshausen,et al. High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[84] Michael W. Mahoney,et al. PyHessian: Neural Networks Through the Lens of the Hessian , 2019, 2020 IEEE International Conference on Big Data (Big Data).

[85] Parikshit Shah,et al. Sketching Sparse Matrices, Covariances, and Graphs via Tensor Products , 2015, IEEE Transactions on Information Theory.

[86] Petros Drineas,et al. Lectures on Randomized Numerical Linear Algebra , 2017, IAS/Park City Mathematics Series.

[87] David Ruppert,et al. RAPTT: An Exact Two-Sample Test in High Dimensions Using Random Projections , 2014, 1405.1792.

[88] Martin J. Wainwright,et al. Randomized sketches of convex programs with sharp guarantees , 2014, 2014 IEEE International Symposium on Information Theory.

[89] Petros Drineas,et al. FAST MONTE CARLO ALGORITHMS FOR MATRICES II: COMPUTING A LOW-RANK APPROXIMATION TO A MATRIX∗ , 2004 .

[90] J. W. Silverstein,et al. Spectral Analysis of Large Dimensional Random Matrices , 2009 .

[91] Yueqi Sheng,et al. One-shot distributed ridge regression in high dimensions , 2019, ICML.

[92] Bernard Chazelle,et al. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform , 2006, STOC '06.

[93] Ping Ma,et al. A statistical perspective on algorithmic leveraging , 2013, J. Mach. Learn. Res..

[94] Gideon S. Mann,et al. Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models , 2009, NIPS.

[95] R. Samworth,et al. Random‐projection ensemble classification , 2015, 1504.04595.

[96] R. Tibshirani,et al. Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[97] Mert Pilanci,et al. Debiasing Distributed Second Order Optimization with Surrogate Sketching and Scaled Regularization , 2020, NeurIPS.

[98] Alexander J. Smola,et al. AIDE: Fast and Communication Efficient Distributed Optimization , 2016, ArXiv.

[99] V. Rokhlin,et al. A randomized algorithm for the approximation of matrices , 2006 .

[100] Anja Vogler,et al. An Introduction to Multivariate Statistical Analysis , 2004 .

[101] Alexander J. Smola,et al. Parallelized Stochastic Gradient Descent , 2010, NIPS.

[102] Marko Znidaric,et al. Asymptotic Expansion for Inverse Moments of Binomial and PoissonDistributions , 2005, math/0511226.

[103] S. Muthukrishnan,et al. Faster least squares approximation , 2007, Numerische Mathematik.

[104] Martin J. Wainwright,et al. Iterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained Least-Squares , 2014, J. Mach. Learn. Res..

[105] John C. Duchi,et al. Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[106] Michael W. Mahoney. Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..