Breaking Locality Accelerates Block Gauss-Seidel

Recent work by Nesterov and Stich showed that momentum can be used to accelerate the rate of convergence for block Gauss-Seidel in the setting where a fixed partitioning of the coordinates is chosen ahead of time. We show that this setting is too restrictive, constructing instances where breaking locality by running non-accelerated Gauss-Seidel with randomly sampled coordinates substantially outperforms accelerated Gauss-Seidel with any fixed partitioning. Motivated by this finding, we analyze the accelerated block Gauss-Seidel algorithm in the random coordinate sampling setting. Our analysis captures the benefit of acceleration with a new data-dependent parameter which is well behaved when the matrix sub-blocks are well-conditioned. Empirically, we show that accelerated Gauss-Seidel with random coordinate sampling provides speedups for large scale machine learning tasks when compared to non-accelerated Gauss-Seidel and the classical conjugate-gradient algorithm.

[1]  C. Kelley Iterative Methods for Linear and Nonlinear Equations , 1987 .

[2]  Christopher Ré,et al.  Parallel stochastic gradient algorithms for large-scale matrix completion , 2013, Math. Program. Comput..

[3]  Peter Richtárik,et al.  Randomized Iterative Methods for Linear Systems , 2015, SIAM J. Matrix Anal. Appl..

[4]  Peter Richtárik,et al.  Accelerated, Parallel, and Proximal Coordinate Descent , 2013, SIAM J. Optim..

[5]  Peter Richtárik,et al.  Coordinate descent with arbitrary sampling II: expected separable overapproximation , 2014, Optim. Methods Softw..

[6]  Michael I. Jordan,et al.  A Lyapunov Analysis of Momentum Methods in Optimization , 2016, ArXiv.

[7]  Separable Overapproximation,et al.  Coordinate Descent with Arbitrary Sampling II: Expected , 2017 .

[8]  Stephen J. Wright,et al.  Random permutations fix a worst case for cyclic coordinate descent , 2016, IMA Journal of Numerical Analysis.

[9]  Zeyuan Allen Zhu,et al.  Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling , 2015, ICML.

[10]  R. Vershynin,et al.  A Randomized Kaczmarz Algorithm with Exponential Convergence , 2007, math/0702226.

[11]  Stephen J. Wright Coordinate descent algorithms , 2015, Mathematical Programming.

[12]  Peter Richtárik,et al.  Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function , 2011, Mathematical Programming.

[13]  Paul Tseng,et al.  A coordinate gradient descent method for nonsmooth separable minimization , 2008, Math. Program..

[14]  Lorenzo Rosasco,et al.  FALKON: An Optimal Large Scale Kernel Method , 2017, NIPS.

[15]  Stephen J. Wright,et al.  An accelerated randomized Kaczmarz algorithm , 2013, Math. Comput..

[16]  Kimon Fountoulakis,et al.  A flexible coordinate descent method , 2015, Comput. Optim. Appl..

[17]  Peter Richtárik,et al.  Randomized Dual Coordinate Ascent with Arbitrary Sampling , 2014, ArXiv.

[18]  Yin Tat Lee,et al.  Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[19]  Peter Richtárik,et al.  Coordinate descent with arbitrary sampling I: algorithms and complexity† , 2014, Optim. Methods Softw..

[20]  David P. Woodruff,et al.  Faster Kernel Ridge Regression Using Sketching and Preconditioning , 2016, SIAM J. Matrix Anal. Appl..

[21]  Lin Xiao,et al.  An Accelerated Proximal Coordinate Gradient Method , 2014, NIPS.

[22]  Benjamin Recht,et al.  Large Scale Kernel Learning using Block Coordinate Descent , 2016, ArXiv.

[23]  Mark W. Schmidt,et al.  Convergence Rates for Greedy Kaczmarz Algorithms, and Randomized Kaczmarz Rules Using the Orthogonality Graph , 2016, UAI.

[24]  Deanna Needell,et al.  Paved with Good Intentions: Analysis of a Randomized Block Kaczmarz Method , 2012, ArXiv.

[25]  Yurii Nesterov,et al.  Efficiency of the Accelerated Coordinate Descent Method on Structured Optimization Problems , 2017, SIAM J. Optim..

[26]  Lin Xiao,et al.  On the complexity analysis of randomized block-coordinate descent methods , 2013, Mathematical Programming.

[27]  Peter Richtárik,et al.  SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization , 2015, ICML.

[28]  Adrian S. Lewis,et al.  Randomized Methods for Linear Constraints: Convergence Rates and Conditioning , 2008, Math. Oper. Res..

[29]  Mark W. Schmidt,et al.  Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection , 2015, ICML.

[30]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[31]  Andrew Y. Ng,et al.  Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[32]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[33]  Benjamin Recht,et al.  KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics , 2016, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[34]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.