论文信息 - Adding vs. Averaging in Distributed Primal-Dual Optimization

Adding vs. Averaging in Distributed Primal-Dual Optimization

Distributed optimization methods for large-scale machine learning suffer from a communication bottleneck. It is difficult to reduce this bottleneck while still efficiently and accurately aggregating partial work from different machines. In this paper, we present a novel generalization of the recent communication-efficient primal-dual framework (COCOA) for distributed optimization. Our framework, COCOA+, allows for additive combination of local updates to the global parameters at each iteration, whereas previous schemes with convergence guarantees only allow conservative averaging. We give stronger (primal-dual) convergence rate guarantees for both COCOA as well as our new variants, and generalize the theory for both methods to cover non-smooth convex loss functions. We provide an extensive experimental comparison that shows the markedly improved performance of COCOA+ on several real-world distributed datasets, especially when scaling up the number of machines.

[1] Gideon S. Mann,et al. Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models , 2009, NIPS.

[2] Alexander J. Smola,et al. Parallelized Stochastic Gradient Descent , 2010, NIPS.

[3] Georgios B. Giannakis,et al. Consensus-Based Distributed Support Vector Machines , 2010, J. Mach. Learn. Res..

[4] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[5] Dmitry Pechyony,et al. Solving Large Scale Linear SVM with Distributed Block Minimization , 2011 .

[6] Stephen P. Boyd,et al. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[7] Martin J. Wainwright,et al. Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[8] Maria-Florina Balcan,et al. Distributed Learning, Communication Complexity and Privacy , 2012, COLT.

[9] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[10] Chih-Jen Lin,et al. Large Linear Classification When Data Cannot Fit in Memory , 2011, TKDD.

[11] Rong Jin,et al. On Theoretical Analysis of Distributed Stochastic Dual Coordinate Ascent , 2013, ArXiv.

[12] Shai Shalev-Shwartz,et al. Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[13] Shai Shalev-Shwartz,et al. Accelerated Mini-Batch Stochastic Dual Coordinate Ascent , 2013, NIPS.

[14] Rong Jin,et al. Analysis of Distributed Stochastic Dual Coordinate Ascent , 2013, 1312.1031.

[15] Michael I. Jordan,et al. Estimation, Optimization, and Parallelism when Data is Sparse , 2013, NIPS.

[16] Tianbao Yang,et al. Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent , 2013, NIPS.

[17] Thomas Hofmann,et al. Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[18] Peter Richtárik,et al. Fast distributed coordinate descent for non-strongly convex losses , 2014, 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP).

[19] Ohad Shamir,et al. Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[20] Ohad Shamir,et al. Distributed stochastic optimization and learning , 2014, 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[21] Peter Richtárik,et al. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function , 2011, Mathematical Programming.

[22] Brian McWilliams,et al. LOCO: Distributing Ridge Regression with Random Projections , 2014, 1406.3469.

[23] Peter Richtárik,et al. Randomized Dual Coordinate Ascent with Arbitrary Sampling , 2014, ArXiv.

[24] Lin Xiao,et al. On the complexity analysis of randomized block-coordinate descent methods , 2013, Mathematical Programming.

[25] Stephen J. Wright,et al. An asynchronous parallel stochastic coordinate descent algorithm , 2013, J. Mach. Learn. Res..

[26] Peter Richtárik,et al. Distributed Block Coordinate Descent for Minimizing Partially Separable Functions , 2014, 1406.0238.

[27] Yuchen Zhang,et al. Communication-Efficient Distributed Optimization of Self-Concordant Empirical Loss , 2015, ArXiv.

[28] Peter Richtárik,et al. Accelerated, Parallel, and Proximal Coordinate Descent , 2013, SIAM J. Optim..

[29] Dan Roth,et al. Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM , 2015, ICML.

[30] Stephen J. Wright,et al. Asynchronous Stochastic Coordinate Descent: Parallelism and Convergence Properties , 2014, SIAM J. Optim..

[31] Yuchen Zhang,et al. DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[32] Peter Richtárik,et al. Coordinate descent with arbitrary sampling I: algorithms and complexity† , 2014, Optim. Methods Softw..

[33] Peter Richtárik,et al. Distributed Coordinate Descent Method for Learning with Big Data , 2013, J. Mach. Learn. Res..

[34] Tong Zhang,et al. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Math. Program..

[35] Peter Richtárik,et al. On optimal probabilities in stochastic coordinate descent methods , 2013, Optim. Lett..

[36] Peter Richtárik,et al. Parallel coordinate descent methods for big data optimization , 2012, Mathematical Programming.

[37] Michael I. Jordan,et al. Distributed optimization with arbitrary local solvers , 2015, Optim. Methods Softw..