Manifold Identification for Ultimately Communication-Efficient Distributed Optimization

This work proposes a progressive manifold identification approach for distributed optimization with sound theoretical justifications to greatly reduce both the rounds of communication and the bytes communicated per round for partlysmooth regularized problems such as the `1and group-LASSO-regularized ones. Our twostage method first uses an inexact proximal quasiNewton method to iteratively identify a sequence of low-dimensional manifolds in which the final solution would lie, and restricts the model update within the current manifold to gradually lower the order of the per-round communication cost from the problem dimension to the dimension of the manifold that contains a solution and makes the problem within it smooth. After identifying this manifold, we take superlinear-convergent truncated semismooth Newton steps computed by preconditioned conjugate gradient to largely reduce the communication rounds by improving the convergence rate from the existing linear or sublinear ones to a superlinear rate. Experiments show that our method can be two orders of magnitude better in the communication cost and an order of magnitude faster in the running time than state of the art.

[1]  Stephen J. Wright,et al.  Inexact Successive quadratic approximation for regularized optimization , 2018, Comput. Optim. Appl..

[2]  Qing Ling,et al.  EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization , 2014, 1404.6264.

[3]  Pradeep Ravikumar,et al.  Proximal Quasi-Newton for Computationally Intensive L1-regularized M-estimators , 2014, NIPS.

[4]  Mark W. Schmidt,et al.  Let's Make Block Coordinate Descent Go Fast: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence , 2017 .

[5]  Mohamed-Jalal Fadili,et al.  Local Convergence Properties of Douglas–Rachford and Alternating Direction Method of Multipliers , 2017, Journal of Optimization Theory and Applications.

[6]  Jorge Nocedal,et al.  Representations of quasi-Newton matrices and their use in limited memory methods , 1994, Math. Program..

[7]  Adrian S. Lewis,et al.  Identifying Active Manifolds , 2007, Algorithmic Oper. Res..

[8]  Thomas Hofmann,et al.  A Distributed Second-Order Algorithm You Can Trust , 2018, ICML.

[9]  Mohamed-Jalal Fadili,et al.  Activity Identification and Local Linear Convergence of Forward-Backward-type Methods , 2015, SIAM J. Optim..

[10]  W. L. Hare,et al.  Identifying Active Manifolds in Regularization Problems , 2011, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[11]  R. Mifflin Semismooth and Semiconvex Functions in Constrained Optimization , 1977 .

[12]  Georgios B. Giannakis,et al.  LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning , 2018, NeurIPS.

[13]  Chih-Jen Lin,et al.  Preconditioned Conjugate Gradient Methods in Truncated Newton Frameworks for Large-scale Linear Classification , 2018, ACML.

[14]  Stephen J. Wright Accelerated Block-coordinate Relaxation for Regularized Optimization , 2012, SIAM J. Optim..

[15]  Stephen J. Wright,et al.  A Distributed Quasi-Newton Algorithm for Primal and Dual Regularized Empirical Risk Minimization , 2019, ArXiv.

[16]  Jingwei Liang,et al.  Local Convergence Properties of SAGA/Prox-SVRG and Acceleration , 2018, ICML.

[17]  Stephen J. Wright,et al.  Sparse Reconstruction by Separable Approximation , 2008, IEEE Transactions on Signal Processing.

[18]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[19]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[20]  A. Lewis,et al.  Identifying active constraints via partial smoothness and prox-regularity , 2003 .

[21]  Mark W. Schmidt,et al.  “Active-set complexity” of proximal gradient: How long does it take to find the sparsity pattern? , 2017, Optim. Lett..

[22]  Kenneth Heafield,et al.  Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[23]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[24]  Kai-Wei Chang,et al.  Distributed block-diagonal approximation methods for regularized empirical risk minimization , 2017, Machine Learning.

[25]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[26]  Adrian S. Lewis,et al.  Active Sets, Nonsmoothness, and Sensitivity , 2002, SIAM J. Optim..

[27]  Mark W. Schmidt,et al.  Are we there yet? Manifold identification of gradient-related proximal methods , 2019, AISTATS.

[28]  J. Hiriart-Urruty,et al.  Generalized Hessian matrix and second-order optimality conditions for problems withC1,1 data , 1984 .

[29]  Liqun Qi,et al.  A nonsmooth version of Newton's method , 1993, Math. Program..

[30]  Bastian Goldlücke,et al.  Variational Analysis , 2014, Computer Vision, A Reference Guide.

[31]  Tyler B. Johnson,et al.  Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization , 2015, ICML.

[32]  Stephen J. Wright,et al.  Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning , 2012, J. Mach. Learn. Res..

[33]  Masao Fukushima,et al.  On the Global Convergence of the BFGS Method for Nonconvex Unconstrained Optimization Problems , 2000, SIAM J. Optim..

[34]  Chia-Hua Ho,et al.  An improved GLMNET for l1-regularized logistic regression , 2011, J. Mach. Learn. Res..

[35]  Yuchen Zhang,et al.  DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[36]  Chih-Jen Lin,et al.  Limited-memory Common-directions Method for Distributed Optimization and its Application on Empirical Risk Minimization , 2017, SDM.

[37]  Mohamed-Jalal Fadili,et al.  Sensitivity Analysis for Mirror-Stratifiable Convex Functions , 2017, SIAM J. Optim..

[38]  Tong Zhang,et al.  A General Distributed Dual Coordinate Optimization Framework for Regularized Loss Minimization , 2016, J. Mach. Learn. Res..

[39]  S. Sundararajan,et al.  A distributed block coordinate descent method for training $l_1$ regularized linear classifiers , 2014, J. Mach. Learn. Res..

[40]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[41]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[42]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[43]  Adrian S. Lewis,et al.  Partial Smoothness, Tilt Stability, and Generalized Hessians , 2013, SIAM J. Optim..

[44]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[45]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.