CoCoA: A General Framework for Communication-Efficient Distributed Optimization

The scale of modern datasets necessitates the development of efficient distributed optimization methods for machine learning. We present a general-purpose framework for distributed computing environments, CoCoA, that has an efficient communication scheme and is applicable to a wide variety of problems in machine learning and signal processing. We extend the framework to cover general non-strongly-convex regularizers, including L1-regularized problems like lasso, sparse logistic regression, and elastic net regularization, and show how earlier work can be derived as a special case. We provide convergence guarantees for the class of convex regularized loss minimization objectives, leveraging a novel approach in handling non-strongly-convex regularizers and non-smooth loss functions. The resulting framework has markedly improved performance over state-of-the-art methods, as we illustrate with an extensive set of experiments on real distributed datasets.

[1]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[2]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[3]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[4]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[5]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[6]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[7]  Gideon S. Mann,et al.  Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models , 2009, NIPS.

[8]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[9]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[10]  S. V. N. Vishwanathan,et al.  A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning , 2008, J. Mach. Learn. Res..

[11]  Chih-Jen Lin,et al.  A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification , 2010, J. Mach. Learn. Res..

[12]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[13]  Joseph K. Bradley,et al.  Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[14]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[15]  Dmitry Pechyony,et al.  Solving Large Scale Linear SVM with Distributed Block Minimization , 2011 .

[16]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[17]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[18]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[19]  Chia-Hua Ho,et al.  An improved GLMNET for l1-regularized logistic regression , 2011, J. Mach. Learn. Res..

[20]  Maria-Florina Balcan,et al.  Distributed Learning, Communication Complexity and Privacy , 2012, COLT.

[21]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[22]  Chih-Jen Lin,et al.  Large Linear Classification When Data Cannot Fit in Memory , 2011, TKDD.

[23]  Kang G. Shin,et al.  Efficient Distributed Linear Classification Algorithms via the Alternating Direction Method of Multipliers , 2012, AISTATS.

[24]  Rong Jin,et al.  On Theoretical Analysis of Distributed Stochastic Dual Coordinate Ascent , 2013, ArXiv.

[25]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[26]  João M. F. Xavier,et al.  D-ADMM: A Communication-Efficient Distributed Algorithm for Separable Optimization , 2012, IEEE Transactions on Signal Processing.

[27]  Shai Shalev-Shwartz,et al.  Accelerated Mini-Batch Stochastic Dual Coordinate Ascent , 2013, NIPS.

[28]  Rong Jin,et al.  Analysis of Distributed Stochastic Dual Coordinate Ascent , 2013, 1312.1031.

[29]  Michael I. Jordan,et al.  Estimation, Optimization, and Parallelism when Data is Sparse , 2013, NIPS.

[30]  Tianbao Yang,et al.  Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent , 2013, NIPS.

[31]  An Bian,et al.  Parallel Coordinate Descent Newton Method for Efficient $\ell_1$-Regularized Minimization , 2013 .

[32]  Avleen Singh Bijral,et al.  Mini-Batch Primal and Dual Methods for SVMs , 2013, ICML.

[33]  I. Necoara,et al.  Distributed dual gradient methods and error bound conditions , 2014, 1401.4398.

[34]  Thomas Hofmann,et al.  Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[35]  Chih-Jen Lin,et al.  Iteration complexity of feasible descent methods for convex optimization , 2014, J. Mach. Learn. Res..

[36]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[37]  Ohad Shamir,et al.  Distributed stochastic optimization and learning , 2014, 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[38]  Brian McWilliams,et al.  LOCO: Distributing Ridge Regression with Random Projections , 2014, 1406.3469.

[39]  Lin Xiao,et al.  On the complexity analysis of randomized block-coordinate descent methods , 2013, Mathematical Programming.

[40]  Shou-De Lin,et al.  A Dual Augmented Block Minimization Framework for Learning with Limited Memory , 2015, NIPS.

[41]  Tyler B. Johnson,et al.  Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization , 2015, ICML.

[42]  Stephen J. Wright Coordinate descent algorithms , 2015, Mathematical Programming.

[43]  Michael I. Jordan,et al.  L1-Regularized Distributed Optimization: A Communication-Efficient Primal-Dual Framework , 2015, ArXiv.

[44]  I. Necoara Linear convergence of first order methods under weak nondegeneracy assumptions for convex programming , 2015 .

[45]  Peter Richtárik,et al.  Distributed Block Coordinate Descent for Minimizing Partially Separable Functions , 2014, 1406.0238.

[46]  Ilya Trofimov,et al.  Distributed Coordinate Descent for L1-regularized Logistic Regression , 2015, AIST.

[47]  Yuchen Zhang,et al.  Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization , 2014, ICML.

[48]  Peter Richtárik,et al.  Accelerated, Parallel, and Proximal Coordinate Descent , 2013, SIAM J. Optim..

[49]  Dan Roth,et al.  Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM , 2015, ICML.

[50]  Michael I. Jordan,et al.  Adding vs. Averaging in Distributed Primal-Dual Optimization , 2015, ICML.

[51]  Peter Richtárik,et al.  Distributed Mini-Batch SDCA , 2015, ArXiv.

[52]  Ohad Shamir,et al.  Communication Complexity of Distributed Convex Learning and Optimization , 2015, NIPS.

[53]  Peter Richtárik,et al.  Quartz: Randomized Dual Coordinate Ascent with Arbitrary Sampling , 2015, NIPS.

[54]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[55]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[56]  Martin Jaggi,et al.  Primal-Dual Rates and Certificates , 2016, ICML.

[57]  Peter Richtárik,et al.  SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization , 2015, ICML.

[58]  Brian McWilliams,et al.  DUAL-LOCO: Distributing Statistical Estimation Using Random Projections , 2015, AISTATS.

[59]  Virginia Smith,et al.  Distributed Optimization for Non-Strongly Convex Regularizers , 2016 .

[60]  Peter Richtárik,et al.  Distributed Coordinate Descent Method for Learning with Big Data , 2013, J. Mach. Learn. Res..

[61]  Leon Hirsch,et al.  Fundamentals Of Convex Analysis , 2016 .

[62]  Karolin Papst,et al.  Techniques Of Variational Analysis , 2016 .

[63]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[64]  Tong Zhang,et al.  A General Distributed Dual Coordinate Optimization Framework for Regularized Loss Minimization , 2016, J. Mach. Learn. Res..

[65]  Ameet Talwalkar,et al.  Federated Multi-Task Learning , 2017, NIPS.

[66]  Martin Jaggi,et al.  Efficient Use of Limited-Memory Accelerators for Linear Learning on Heterogeneous Systems , 2017, NIPS.

[67]  Matilde Gargiani Hessian-CoCoA : a general parallel and distributed framework for non-strongly convex regularizers , 2017 .

[68]  Ilya Trofimov,et al.  Distributed coordinate descent for generalized linear models with regularization , 2017, Pattern Recognition and Image Analysis.

[69]  Michael I. Jordan,et al.  Distributed optimization with arbitrary local solvers , 2015, Optim. Methods Softw..

[70]  S. Sundararajan,et al.  A distributed block coordinate descent method for training $l_1$ regularized linear classifiers , 2014, J. Mach. Learn. Res..

[71]  Francis Bach,et al.  AdaBatch: Efficient Gradient Aggregation Rules for Sequential and Parallel Stochastic Gradient Methods , 2017, ArXiv.

[72]  Thomas Hofmann,et al.  A Distributed Second-Order Algorithm You Can Trust , 2018, ICML.

[73]  Martin Jaggi,et al.  Adaptive balancing of gradient and update computation times using global geometry and approximate subproblems , 2018, AISTATS.

[74]  Peter Richtárik,et al.  On the complexity of parallel coordinate descent , 2015, Optim. Methods Softw..

[75]  Stephen J. Wright,et al.  A Distributed Quasi-Newton Algorithm for Empirical Risk Minimization with Nonsmooth Regularization , 2018, KDD.

[76]  Martin Jaggi,et al.  Global linear convergence of Newton's method without strong-convexity or Lipschitz gradients , 2018, ArXiv.

[77]  Kai-Wei Chang,et al.  Distributed block-diagonal approximation methods for regularized empirical risk minimization , 2017, Machine Learning.