论文信息 - Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection - 字舞流文

Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection

There has been significant recent work on the theory and application of randomized coordinate descent algorithms, beginning with the work of Nesterov [SIAM J. Optim., 22(2), 2012], who showed that a random-coordinate selection rule achieves the same convergence rate as the Gauss-Southwell selection rule. This result suggests that we should never use the Gauss-Southwell rule, as it is typically much more expensive than random selection. However, the empirical behaviours of these algorithms contradict this theoretical result: in applications where the computational costs of the selection rules are comparable, the Gauss-Southwell selection rule tends to perform substantially better than random coordinate selection. We give a simple analysis of the Gauss-Southwell rule showing that---except in extreme cases---its convergence rate is faster than choosing random coordinates. Further, in this work we (i) show that exact coordinate optimization improves the convergence rate for certain sparse problems, (ii) propose a Gauss-Southwell-Lipschitz rule that gives an even faster convergence rate given knowledge of the Lipschitz constants of the partial derivatives, (iii) analyze the effect of approximate Gauss-Southwell rules, and (iv) analyze proximal-gradient variants of the Gauss-Southwell rule.

Mark W. Schmidt | Michael P. Friedlander | Issam H. Laradji | Julie Nutini | Hoyt A. Koepke | M. Friedlander | J. Nutini | H. Koepke | I. Laradji

[1] W. Ferger. The Nature and Use of the Harmonic Mean , 1931 .

[2] Ronald L. Rivest,et al. Introduction to Algorithms , 1990 .

[3] Z.-Q. Luo,et al. Error bounds and convergence analysis of feasible descent methods: a general approach , 1993, Ann. Oper. Res..

[4] Dimitri P. Bertsekas,et al. Nonlinear Programming , 1997 .

[5] John D. Lafferty,et al. Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[6] Gunnar Rätsch,et al. On the Convergence of Leveraging , 2001, NIPS.

[7] Bernhard Schölkopf,et al. Learning with Local and Global Consistency , 2003, NIPS.

[8] S. Sathiya Keerthi,et al. A simple and efficient algorithm for gene selection using sparse logistic regression , 2003, Bioinform..

[9] Heather Bourbeau. Greed Is Good , 2004 .

[10] Yurii Nesterov,et al. Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[11] Leonhard Held,et al. Gaussian Markov Random Fields: Theory and Applications , 2005 .

[12] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[13] Alexander Zien,et al. Label Propagation and Quadratic Criterion , 2006 .

[14] Daphne Koller,et al. Efficient Structure Learning of Markov Networks using L1-Regularization , 2006, NIPS.

[15] Thomas Hofmann,et al. Efficient Structure Learning of Markov Networks using L1-Regularization , 2007 .

[16] Chih-Jen Lin,et al. A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[17] K. Lange,et al. Coordinate descent algorithms for lasso penalized regression , 2008, 0803.3876.

[18] Cristian Sminchisescu,et al. Greedy Block Coordinate Descent for Large Scale Gaussian Process Regression , 2008, UAI.

[19] Stephen M. Omohundro,et al. Five Balltree Construction Algorithms , 2009 .

[20] S. Osher,et al. Coordinate descent optimization for l 1 minimization with application to compressed sensing; a greedy algorithm , 2009 .

[21] Katya Scheinberg,et al. IBM Research Report SINCO - A Greedy Coordinate Ascent Method for Sparse Inverse Covariance Selection Problem , 2009 .

[22] Paul Tseng,et al. A coordinate gradient descent method for nonsmooth separable minimization , 2008, Math. Program..

[23] Adrian S. Lewis,et al. Randomized Methods for Linear Constraints: Convergence Rates and Conditioning , 2008, Math. Oper. Res..

[24] Pradeep Ravikumar,et al. Nearest Neighbor based Greedy Coordinate Descent , 2011, NIPS.

[25] Yurii Nesterov,et al. Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[26] Tommi S. Jaakkola,et al. Convergence Rate Analysis of MAP Coordinate Minimization Algorithms , 2012, NIPS.

[27] Stephen J. Wright. Accelerated Block-coordinate Relaxation for Regularized Optimization , 2012, SIAM J. Optim..

[28] Mark W. Schmidt,et al. Hybrid Deterministic-Stochastic Methods for Data Fitting , 2011, SIAM J. Sci. Comput..

[29] Amir Beck,et al. On the Convergence of Block Coordinate Descent Type Methods , 2013, SIAM J. Optim..

[30] Shai Shalev-Shwartz,et al. Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[31] Ping Li,et al. Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) , 2014, NIPS.

[32] Peter Richtárik,et al. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function , 2011, Mathematical Programming.

[33] André Uschmajew,et al. On Convergence of the Maximum Block Improvement Method , 2015, SIAM J. Optim..

[34] Peter Richtárik,et al. Accelerated, Parallel, and Proximal Coordinate Descent , 2013, SIAM J. Optim..

[35] Peter Richtárik,et al. Parallel coordinate descent methods for big data optimization , 2012, Mathematical Programming.

[36] Yong Jiang,et al. Accelerated Stochastic Greedy Coordinate Descent by Soft Thresholding Projection onto Simplex , 2017, NIPS.

[37] Mark W. Schmidt,et al. Let's Make Block Coordinate Descent Go Fast: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence , 2017 .