Debiasing Distributed Second Order Optimization with Surrogate Sketching and Scaled Regularization

In distributed second order optimization, a standard strategy is to average many local estimates, each of which is based on a small sketch or batch of the data. However, the local estimates on each machine are typically biased, relative to the full solution on all of the data, and this can limit the effectiveness of averaging. Here, we introduce a new technique for debiasing the local estimates, which leads to both theoretical and empirical improvements in the convergence rate of distributed second order methods. Our technique has two novel components: (1) modifying standard sketching techniques to obtain what we call a surrogate sketch; and (2) carefully scaling the global regularization parameter for local computations. Our surrogate sketches are based on determinantal point processes, a family of distributions for which the bias of an estimate of the inverse Hessian can be computed exactly. Based on this computation, we show that when the objective being minimized is $l_2$-regularized with parameter $\lambda$ and individual machines are each given a sketch of size $m$, then to eliminate the bias, local estimates should be computed using a shrunk regularization parameter given by $\lambda^{\prime}=\lambda\cdot(1-\frac{d_{\lambda}}{m})$, where $d_{\lambda}$ is the $\lambda$-effective dimension of the Hessian (or, for quadratic problems, the data matrix).

[1]  Daniele Calandriello,et al.  Exact sampling of determinantal point processes with sublinear time preprocessing , 2019, NeurIPS.

[2]  Richard Y. Chen,et al.  The Masked Sample Covariance Estimator: An Analysis via Matrix Concentration Inequalities , 2011, 1109.1637.

[3]  Michael W. Mahoney,et al.  Determinantal Point Processes in Randomized Numerical Linear Algebra , 2020, Notices of the American Mathematical Society.

[4]  Aryan Mokhtari,et al.  Network Newton Distributed Optimization Methods , 2017, IEEE Transactions on Signal Processing.

[5]  Yuchen Zhang,et al.  DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[6]  Martin J. Wainwright,et al.  Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic Convergence , 2015, SIAM J. Optim..

[7]  Shusen Wang,et al.  GIANT: Globally Improved Approximate Newton Method for Distributed Optimization , 2017, NeurIPS.

[8]  Mert Pilanci,et al.  Effective Dimension Adaptive Sketching Methods for Faster Regularized Least-Squares Optimization , 2020, NeurIPS.

[9]  Mert Pilanci,et al.  Fast Randomized Algorithms for Convex Optimization and Statistical Estimation , 2016 .

[10]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[11]  S. Muthukrishnan,et al.  Sampling algorithms for l2 regression and applications , 2006, SODA '06.

[12]  Marco Pavone,et al.  High-Dimensional Optimization in Adaptive Random Subspaces , 2019, NeurIPS.

[13]  Michael W. Mahoney,et al.  Precise expressions for random projections: Low-rank approximation and randomized Newton , 2020, NeurIPS.

[14]  Kenneth L. Clarkson,et al.  Minimax experimental design: Bridging the gap between statistical and worst-case approaches to least squares regression , 2019, COLT.

[15]  T. Liggett,et al.  Negative dependence and the geometry of polynomials , 2007, 0707.2340.

[16]  Mert Pilanci,et al.  Optimal Randomized First-Order Methods for Least-Squares Problems , 2020, ICML.

[17]  Ben Taskar,et al.  Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..

[18]  Michael W. Mahoney,et al.  Fast Randomized Kernel Ridge Regression with Statistical Guarantees , 2015, NIPS.

[19]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[20]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[21]  Manfred K. Warmuth,et al.  Reverse iterative volume sampling for linear regression , 2018, J. Mach. Learn. Res..

[22]  Mert Pilanci,et al.  Lower Bounds and a Near-Optimal Shrinkage Estimator for Least Squares Using Random Projections , 2020, IEEE Journal on Selected Areas in Information Theory.

[23]  Martin J. Wainwright,et al.  Randomized sketches for kernels: Fast and optimal non-parametric regression , 2015, ArXiv.

[24]  Michal Derezinski,et al.  Fast determinantal point processes via distortion-free intermediate sampling , 2018, COLT.

[25]  D. Bernstein Matrix Mathematics: Theory, Facts, and Formulas , 2009 .

[26]  Mert Pilanci,et al.  Faster Least Squares Optimization , 2019, ArXiv.

[27]  Manfred K. Warmuth,et al.  Correcting the bias in least squares regression with volume-rescaled sampling , 2018, AISTATS.

[28]  Mert Pilanci,et al.  Distributed Sketching Methods for Privacy Preserving Regression , 2020, ArXiv.

[29]  Shusen Wang,et al.  Sketched Ridge Regression: Optimization Perspective, Statistical Perspective, and Model Averaging , 2017, ICML.

[30]  Martin J. Wainwright,et al.  Randomized sketches of convex programs with sharp guarantees , 2014, 2014 IEEE International Symposium on Information Theory.

[31]  Michael W. Mahoney,et al.  Distributed estimation of the inverse Hessian by determinantal averaging , 2019, NeurIPS.

[32]  Michael W. Mahoney,et al.  Exact expressions for double descent and implicit regularization via surrogate random design , 2019, NeurIPS.

[33]  Mert Pilanci,et al.  Distributed Averaging Methods for Randomized Second Order Optimization , 2020, ArXiv.

[34]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2012, STOC '13.

[35]  Martin J. Wainwright,et al.  Iterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained Least-Squares , 2014, J. Mach. Learn. Res..

[36]  Dragana Bajović,et al.  Newton-like Method with Diagonal Correction for Distributed Optimization , 2015, SIAM J. Optim..

[37]  Daniele Calandriello,et al.  Sampling from a k-DPP without looking at all items , 2020, NeurIPS.

[38]  Mert Pilanci,et al.  Regularized Momentum Iterative Hessian Sketch for Large Scale Linear System of Equations , 2019, ArXiv.

[39]  Zhao Song,et al.  A Matrix Chernoff Bound for Strongly Rayleigh Distributions and Spectral Sparsifiers from a few Random Spanning Trees , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[40]  Mert Pilanci,et al.  Limiting Spectrum of Randomized Hadamard Transform and Optimal Iterative Sketching Methods , 2020, ArXiv.

[41]  Alexander J. Smola,et al.  AIDE: Fast and Communication Efficient Distributed Optimization , 2016, ArXiv.