Let's Make Block Coordinate Descent Go Fast: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence

Block coordinate descent (BCD) methods are widely-used for large-scale numerical optimization because of their cheap iteration costs, low memory requirements, amenability to parallelization, and ability to exploit problem structure. Three main algorithmic choices influence the performance of BCD methods: the block partitioning strategy, the block selection rule, and the block update rule. In this paper we explore all three of these building blocks and propose variations for each that can lead to significantly faster BCD methods. We (i) propose new greedy block-selection strategies that guarantee more progress per iteration than the Gauss-Southwell rule; (ii) explore practical issues like how to implement the new rules when using "variable" blocks; (iii) explore the use of message-passing to compute matrix or Newton updates efficiently on huge blocks for problems with a sparse dependency between variables; and (iv) consider optimal active manifold identification, which leads to bounds on the "active set complexity" of BCD methods and leads to superlinear convergence for certain problems with sparse solutions (and in some cases finite termination at an optimal solution). We support all of our findings with numerical results for the classic machine learning problems of least squares, logistic regression, multi-class logistic regression, label propagation, and L1-regularization.

[1]  S. Parter The Use of Linear Graphs in Gauss Elimination , 1961 .

[2]  D. J. A. Welsh,et al.  An upper bound for the chromatic number of a graph and its application to timetabling problems , 1967, Comput. J..

[3]  D. Rose Triangulated graphs and the elimination process , 1970 .

[4]  James M. Ortega,et al.  Iterative solution of nonlinear equations in several variables , 2014, Computer science and applied mathematics.

[5]  J. J. Moré,et al.  A Characterization of Superlinear Convergence and its Application to Quasi-Newton Methods , 1973 .

[6]  D. Bertsekas On the Goldstein-Levitin-Polyak gradient projection method , 1974, CDC 1974.

[7]  R. Rockafellar Monotone Operators and the Proximal Point Algorithm , 1976 .

[8]  R. R. Hocking The analysis and selection of variables in linear regression , 1976 .

[9]  Nimrod Megiddo,et al.  Combinatorial optimization with rational objective functions , 1978, Math. Oper. Res..

[10]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[11]  R. Dembo,et al.  INEXACT NEWTON METHODS , 1982 .

[12]  D. Bertsekas,et al.  TWO-METRIC PROJECTION METHODS FOR CONSTRAINED OPTIMIZATION* , 1984 .

[13]  J. J. Moré,et al.  On the identification of active constraints , 1988 .

[14]  D. Böhning Multinomial logistic regression algorithm , 1992 .

[15]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[16]  Stephen J. Wright Identifiable Surfaces in Constrained Optimization , 1993 .

[17]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[18]  Wenjiang J. Fu Penalized Regressions: The Bridge versus the Lasso , 1998 .

[19]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[20]  Sergey Bakin,et al.  Adaptive regression and model selection in data mining problems , 1999 .

[21]  M. R. Osborne,et al.  A new approach to variable selection in least squares problems , 2000 .

[22]  P. Tseng,et al.  Block Coordinate Relaxation Methods for Nonparametric Wavelet Denoising , 2000 .

[23]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[24]  R. Mi,et al.  Proximal Points are on the Fast Track , 2002 .

[25]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[26]  A. Lewis,et al.  Identifying active constraints via partial smoothness and prox-regularity , 2003 .

[27]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[28]  Robert Tibshirani,et al.  The Entire Regularization Path for the Support Vector Machine , 2004, J. Mach. Learn. Res..

[29]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[30]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[31]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[32]  Dmitry M. Malioutov,et al.  Walk-Sums and Belief Propagation in Gaussian Graphical Models , 2006, J. Mach. Learn. Res..

[33]  S. Rosset,et al.  Piecewise linear regularized solution paths , 2007, 0708.2197.

[34]  Mike E. Davies,et al.  Iterative Hard Thresholding for Compressed Sensing , 2008, ArXiv.

[35]  Danny Bickson,et al.  Gaussian Belief Propagation: Theory and Aplication , 2008, 0811.2518.

[36]  Vwani P. Roychowdhury,et al.  Covariance selection for nonchordal graphs via chordal embedding , 2008, Optim. Methods Softw..

[37]  P. Tseng,et al.  Block-Coordinate Gradient Descent Method for Linearly Constrained Nonsmooth Separable Optimization , 2009 .

[38]  Paul H. Siegel,et al.  Gaussian belief propagation solver for systems of linear equations , 2008, 2008 IEEE International Symposium on Information Theory.

[39]  Cristian Sminchisescu,et al.  Greedy Block Coordinate Descent for Large Scale Gaussian Process Regression , 2008, UAI.

[40]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[41]  Katya Scheinberg,et al.  IBM Research Report SINCO - A Greedy Coordinate Ascent Method for Sparse Inverse Covariance Selection Problem , 2009 .

[42]  Paul Tseng,et al.  A coordinate gradient descent method for nonsmooth separable minimization , 2008, Math. Program..

[43]  Mark W. Schmidt,et al.  Graphical model structure learning using L₁-regularization , 2010 .

[44]  M. R. Osborne,et al.  A Homotopy Algorithm for the Quantile Regression Lasso and Related Piecewise Linear Problems , 2011 .

[45]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[46]  Pradeep Ravikumar,et al.  Nearest Neighbor based Greedy Coordinate Descent , 2011, NIPS.

[47]  W. L. Hare,et al.  Identifying Active Manifolds in Regularization Problems , 2011, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[48]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[49]  Julien Mairal,et al.  Complexity Analysis of the Lasso Regularization Path , 2012, ICML.

[50]  Tommi S. Jaakkola,et al.  Convergence Rate Analysis of MAP Coordinate Minimization Algorithms , 2012, NIPS.

[51]  Shuzhong Zhang,et al.  Maximum Block Improvement and Polynomial Optimization , 2012, SIAM J. Optim..

[52]  Mohammad Emtiyaz Khan,et al.  Variational learning for latent Gaussian model of discrete data , 2012 .

[53]  Ambuj Tewari,et al.  Feature Clustering for Accelerating Parallel Coordinate Descent , 2012, NIPS.

[54]  Stephen J. Wright Accelerated Block-coordinate Relaxation for Regularized Optimization , 2012, SIAM J. Optim..

[55]  Stephen J. Wright,et al.  Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning , 2012, J. Mach. Learn. Res..

[56]  Inderjit S. Dhillon,et al.  Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems , 2012, 2012 IEEE 12th International Conference on Data Mining.

[57]  Michael A. Saunders,et al.  Proximal Newton-type methods for convex optimization , 2012, NIPS.

[58]  Amir Beck,et al.  On the Convergence of Block Coordinate Descent Type Methods , 2013, SIAM J. Optim..

[59]  Katya Scheinberg,et al.  Noname manuscript No. (will be inserted by the editor) Efficient Block-coordinate Descent Algorithms for the Group Lasso , 2022 .

[60]  Pradeep Ravikumar,et al.  BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables , 2013, NIPS.

[61]  Wotao Yin,et al.  A Block Coordinate Descent Method for Regularized Multiconvex Optimization with Applications to Nonnegative Tensor Factorization and Completion , 2013, SIAM J. Imaging Sci..

[62]  Suvrit Sra,et al.  Reflection methods for user-friendly submodular optimization , 2013, NIPS.

[63]  Ürün Dogan,et al.  Accelerated Coordinate Descent with Adaptive Coordinate Frequencies , 2013, ACML.

[64]  Vivek S. Borkar,et al.  Greedy Block Coordinate Descent (GBCD) Method for High Dimensional Quadratic Programs , 2014 .

[65]  Yifan Sun,et al.  Decomposition in Conic Optimization with Partially Separable Structure , 2013, SIAM J. Optim..

[66]  Peter Richtárik,et al.  Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function , 2011, Mathematical Programming.

[67]  Peter Richtárik,et al.  Randomized Dual Coordinate Ascent with Arbitrary Sampling , 2014, ArXiv.

[68]  Martin S. Andersen,et al.  Chordal Graphs and Semidefinite Optimization , 2015, Found. Trends Optim..

[69]  André Uschmajew,et al.  On Convergence of the Maximum Block Improvement Method , 2015, SIAM J. Optim..

[70]  Mark W. Schmidt,et al.  Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection , 2015, ICML.

[71]  Peter Richtárik,et al.  Accelerated, Parallel, and Proximal Coordinate Descent , 2013, SIAM J. Optim..

[72]  Emanuel Todorov,et al.  Graphical Newton , 2015, ArXiv.

[73]  Jeffrey A. Fessler,et al.  Comparison of SIRT and SQS for Regularized Weighted Least Squares Image Reconstruction , 2015, IEEE Transactions on Computational Imaging.

[74]  Huy L. Nguyen,et al.  Random Coordinate Descent Methods for Minimizing Decomposable Submodular Functions , 2015, ICML.

[75]  Peter Richtárik,et al.  Stochastic Dual Coordinate Ascent with Adaptive Probabilities , 2015, ICML.

[76]  Louis Esperet,et al.  Equitable partition of graphs into induced forests , 2015, Discret. Math..

[77]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[78]  Peter Richtárik,et al.  Coordinate descent with arbitrary sampling I: algorithms and complexity† , 2014, Optim. Methods Softw..

[79]  Michael W. Mahoney,et al.  Exploiting Optimization for Local Graph Clustering , 2016 .

[80]  Peter Richtárik,et al.  Inexact Coordinate Descent: Complexity and Preconditioning , 2013, J. Optim. Theory Appl..

[81]  Peter Richtárik,et al.  Coordinate descent with arbitrary sampling II: expected separable overapproximation , 2014, Optim. Methods Softw..

[82]  James Demmel,et al.  Asynchronous Parallel Greedy Coordinate Descent , 2016, NIPS.

[83]  Peter Richtárik,et al.  On optimal probabilities in stochastic coordinate descent methods , 2013, Optim. Lett..

[84]  Peter Richtárik,et al.  Parallel coordinate descent methods for big data optimization , 2012, Mathematical Programming.

[85]  Stefano Lucidi,et al.  A Fast Active Set Block Coordinate Descent Algorithm for ℓ1-Regularized Least Squares , 2014, SIAM J. Optim..

[86]  Sushant Sachdeva,et al.  Approximate Gaussian Elimination for Laplacians - Fast, Sparse, and Simple , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[87]  Inderjit S. Dhillon,et al.  Coordinate-wise Power Method , 2016, NIPS.

[88]  Martin Jaggi,et al.  Approximate Steepest Coordinate Descent , 2017, ICML.

[89]  Peter Richtárik,et al.  Global Convergence of Arbitrary-Block Gradient Methods for Generalized Polyak-{\L} ojasiewicz Functions , 2017, 1709.03014.

[90]  John C. Duchi,et al.  Adaptive Sampling Probabilities for Non-Smooth Optimization , 2017, ICML.

[91]  Stephen J. Wright,et al.  Random permutations fix a worst case for cyclic coordinate descent , 2016, IMA Journal of Numerical Analysis.

[92]  Kimon Fountoulakis,et al.  A flexible coordinate descent method , 2015, Computational Optimization and Applications.

[93]  Peter Richtárik,et al.  Importance Sampling for Minibatches , 2016, J. Mach. Learn. Res..

[94]  Mark W. Schmidt,et al.  “Active-set complexity” of proximal gradient: How long does it take to find the sparsity pattern? , 2017, Optim. Lett..

[95]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .