论文信息 - Accelerated Coordinate Descent with Arbitrary Sampling and Best Rates for Minibatches - 字舞流文

Accelerated Coordinate Descent with Arbitrary Sampling and Best Rates for Minibatches

Accelerated coordinate descent is a widely popular optimization algorithm due to its efficiency on large-dimensional problems. It achieves state-of-the-art complexity on an important class of empirical risk minimization problems. In this paper we design and analyze an accelerated coordinate descent (ACD) method which in each iteration updates a random subset of coordinates according to an arbitrary but fixed probability law, which is a parameter of the method. If all coordinates are updated in each iteration, our method reduces to the classical accelerated gradient descent method AGD of Nesterov. If a single coordinate is updated in each iteration, and we pick probabilities proportional to the square roots of the coordinate-wise Lipschitz constants, our method reduces to the currently fastest coordinate descent method NUACDM of Allen-Zhu, Qu, Richtarik and Yuan. While mini-batch variants of ACD are more popular and relevant in practice, there is no importance sampling for ACD that outperforms the standard uniform mini-batch sampling. Through insights enabled by our general analysis, we design new importance sampling for mini-batch ACD which significantly outperforms previous state-of-the-art minibatch ACD in practice. We prove a rate that is at most ${\cal O}(\sqrt{\tau})$ times worse than the rate of minibatch ACD with uniform sampling, but can be ${\cal O}(n/\tau)$ times better, where $\tau$ is the minibatch size. Since in modern supervised learning training systems it is standard practice to choose $\tau \ll n$, and often $\tau={\cal O}(1)$, our method can lead to dramatic speedups. Lastly, we obtain similar results for minibatch nonaccelerated CD as well, achieving improvements on previous best rates.

Peter Richtárik | Filip Hanzely | Peter Richtárik | Filip Hanzely

[1] Zeyuan Allen Zhu,et al. Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling , 2015, ICML.

[2] Ambuj Tewari,et al. On the Nonasymptotic Convergence of Cyclic Coordinate Descent Methods , 2013, SIAM J. Optim..

[3] Peter Richtárik,et al. On optimal probabilities in stochastic coordinate descent methods , 2013, Optim. Lett..

[4] Peter Richtárik,et al. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function , 2011, Mathematical Programming.

[5] Stephen J. Wright. Coordinate descent algorithms , 2015, Mathematical Programming.

[6] Peter Richtárik,et al. Stochastic Dual Coordinate Ascent with Adaptive Probabilities , 2015, ICML.

[7] Peter Richtárik,et al. Distributed Coordinate Descent Method for Learning with Big Data , 2013, J. Mach. Learn. Res..

[8] Yurii Nesterov,et al. Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[9] Peter Richtárik,et al. Coordinate descent with arbitrary sampling II: expected separable overapproximation , 2014, Optim. Methods Softw..

[10] P. Tseng. Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization , 2001 .

[11] Shai Shalev-Shwartz,et al. Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[12] Lin Xiao,et al. An Accelerated Proximal Coordinate Gradient Method , 2014, NIPS.

[13] Zeyuan Allen Zhu,et al. Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent , 2014, ITCS.

[14] Mark W. Schmidt,et al. Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection , 2015, ICML.

[15] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[16] Asuman E. Ozdaglar,et al. When Cyclic Coordinate Descent Outperforms Randomized Coordinate Descent , 2017, NIPS.

[17] Fuzhen Zhang. Matrix Theory: Basic Results and Techniques , 1999 .

[18] Martin Jaggi,et al. Approximate Steepest Coordinate Descent , 2017, ICML.

[19] Peter Richtárik,et al. Quartz: Randomized Dual Coordinate Ascent with Arbitrary Sampling , 2015, NIPS.

[20] Peter Richtárik,et al. Parallel coordinate descent methods for big data optimization , 2012, Mathematical Programming.

[21] Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[22] Tong Zhang,et al. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[23] Tong Zhang,et al. Stochastic Optimization with Importance Sampling for Regularized Loss Minimization , 2014, ICML.

[24] P. Tseng,et al. On the convergence of the coordinate descent method for convex differentiable minimization , 1992 .

[25] Martin Jaggi,et al. Safe Adaptive Importance Sampling , 2017, NIPS.

[26] Peter Richtárik,et al. Coordinate descent with arbitrary sampling I: algorithms and complexity† , 2014, Optim. Methods Softw..

[27] Ambuj Tewari,et al. Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[28] Yin Tat Lee,et al. Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[29] Yurii Nesterov,et al. Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[30] Peter Richtárik,et al. Importance Sampling for Minibatches , 2016, J. Mach. Learn. Res..

[31] Peter Richtárik,et al. Accelerated, Parallel, and Proximal Coordinate Descent , 2013, SIAM J. Optim..

[32] Antonin Chambolle,et al. Stochastic Primal-Dual Hybrid Gradient Algorithm with Arbitrary Sampling and Imaging Applications , 2017, SIAM J. Optim..

[33] James Demmel,et al. Asynchronous Parallel Greedy Coordinate Descent , 2016, NIPS.