Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling

Accelerated coordinate descent is widely used in optimization due to its cheap per-iteration cost and scalability to large-scale problems. Up to a primal-dual transformation, it is also the same as accelerated stochastic gradient descent that is one of the central methods used in machine learning. In this paper, we improve the best known running time of accelerated coordinate descent by a factor up to $\sqrt{n}$. Our improvement is based on a clean, novel non-uniform sampling that selects each coordinate with a probability proportional to the square root of its smoothness parameter. Our proof technique also deviates from the classical estimation sequence technique used in prior work. Our speed-up applies to important problems such as empirical risk minimization and solving linear systems, both in theory and in practice.

[1]  Zaïd Harchaoui,et al.  A Universal Catalyst for First-Order Optimization , 2015, NIPS.

[2]  Peter Richtárik,et al.  Parallel coordinate descent methods for big data optimization , 2012, Mathematical Programming.

[3]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[4]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[5]  Emmanuel J. Candès,et al.  Adaptive Restart for Accelerated Gradient Schemes , 2012, Foundations of Computational Mathematics.

[6]  Tong Zhang,et al.  Stochastic Optimization with Importance Sampling for Regularized Loss Minimization , 2014, ICML.

[7]  Zeyuan Allen Zhu,et al.  Optimal Black-Box Reductions Between Optimization Objectives , 2016, NIPS.

[8]  Peter Richtárik,et al.  Distributed Coordinate Descent Method for Learning with Big Data , 2013, J. Mach. Learn. Res..

[9]  Joseph K. Bradley,et al.  Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[10]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[11]  Peter Richtárik,et al.  Randomized Dual Coordinate Ascent with Arbitrary Sampling , 2014, ArXiv.

[12]  Yin Tat Lee,et al.  Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[13]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[14]  Zeyuan Allen Zhu,et al.  Exploiting the Structure: Stochastic Gradient Methods Using Raw Clusters , 2016, NIPS.

[15]  Zeyuan Allen-Zhu Katyusha: The First Truly Accelerated Stochastic Gradient Method , 2016 .

[16]  Yuchen Zhang,et al.  Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization , 2014, ICML.

[17]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[18]  Yurii Nesterov,et al.  Efficiency of the Accelerated Coordinate Descent Method on Structured Optimization Problems , 2017, SIAM J. Optim..

[19]  Avleen Singh Bijral,et al.  Mini-Batch Primal and Dual Methods for SVMs , 2013, ICML.

[20]  Zeyuan Allen Zhu,et al.  Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent , 2014, ITCS.

[21]  Zhaosong Lu,et al.  An Accelerated Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization , 2014, 1407.1296.

[22]  Mohit Singh,et al.  A geometric alternative to Nesterov's accelerated gradient descent , 2015, ArXiv.

[23]  Lin Xiao,et al.  An Accelerated Randomized Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization , 2015, SIAM J. Optim..

[24]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[25]  Lin Xiao,et al.  On the complexity analysis of randomized block-coordinate descent methods , 2013, Mathematical Programming.

[26]  Peter Richtárik,et al.  Coordinate descent with arbitrary sampling I: algorithms and complexity† , 2014, Optim. Methods Softw..

[27]  Mark W. Schmidt,et al.  Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection , 2015, ICML.

[28]  Peter Richtárik,et al.  Stochastic Dual Ascent for Solving Linear Systems , 2015, ArXiv.

[29]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[30]  Peter Richtárik,et al.  Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function , 2011, Mathematical Programming.

[31]  Peter Richtárik,et al.  Accelerated, Parallel, and Proximal Coordinate Descent , 2013, SIAM J. Optim..

[32]  Ion Necoara,et al.  Efficient parallel coordinate descent algorithm for convex optimization problems with separable constraints: Application to distributed MPC , 2013, 1302.3092.

[33]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[34]  R. Vershynin,et al.  A Randomized Kaczmarz Algorithm with Exponential Convergence , 2007, math/0702226.

[35]  Peter Richtárik,et al.  Stochastic Dual Coordinate Ascent with Adaptive Probabilities , 2015, ICML.

[36]  Tong Zhang,et al.  Proximal Stochastic Dual Coordinate Ascent , 2012, ArXiv.

[37]  Peter Richtárik,et al.  Smooth minimization of nonsmooth functions with parallel coordinate descent methods , 2013, Modeling and Optimization: Theory and Applications.

[38]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[39]  Lin Xiao,et al.  An Accelerated Proximal Coordinate Gradient Method , 2014, NIPS.

[40]  Sham M. Kakade,et al.  Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization , 2015, ICML.