Efficient Distributed Online Prediction and Stochastic Optimization With Approximate Distributed Averaging

We study distributed methods for online prediction and stochastic optimization. Our approach is iterative: In each round, nodes first perform local computations and then communicate in order to aggregate information and synchronize their decision variables. Synchronization is accomplished through the use of a distributed averaging protocol. When an exact distributed averaging protocol is used, it is known that the optimal regret bound of O(√(m)) can be achieved using the distributed minibatch algorithm in [“Optimal distributed online prediction using minibatches,” J. Mach. Learn. Res., vol. 13, pp. 165-202, 2012], where m is the total number of samples processed across the network. We focus on methods using approximate distributed averaging protocols and show that the optimal regret bound can also be achieved in this setting. In particular, we propose a gossip-based optimization method that achieves the optimal regret bound. The amount of communication required depends on the network topology through the second largest eigenvalue of the transition matrix of a random walk on the network. In the setting of stochastic optimization, the proposed gossip-based approach achieves nearly linear scaling: The optimization error is guaranteed to be no more than ∈ after O(1/nϵ2) rounds, each of which involves O(log n) gossip iterations, when n nodes communicate over a well-connected graph.

[1]  Michael G. Rabbat,et al.  Communication/Computation Tradeoffs in Consensus-Based Distributed Optimization , 2012, NIPS.

[2]  R. Rockafellar,et al.  On the interchange of subdifferentiation and conditional expectation for convex functionals , 1982 .

[3]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[4]  Soummya Kar,et al.  Gossip Algorithms for Distributed Signal Processing , 2010, Proceedings of the IEEE.

[5]  Gaston H. Gonnet,et al.  On the LambertW function , 1996, Adv. Comput. Math..

[6]  Feng Yan,et al.  Distributed Autonomous Online Learning: Regrets and Intrinsic Privacy-Preserving Properties , 2010, IEEE Transactions on Knowledge and Data Engineering.

[7]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[8]  Johannes Gehrke,et al.  Gossip-based computation of aggregate information , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[9]  Stephen P. Boyd,et al.  Fast linear iterations for distributed averaging , 2003, 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475).

[10]  Ali H. Sayed,et al.  Excess-Risk of Distributed Stochastic Learners , 2013, IEEE Transactions on Information Theory.

[11]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[12]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[13]  Stephen P. Boyd,et al.  Randomized gossip algorithms , 2006, IEEE Transactions on Information Theory.

[14]  Angelia Nedic,et al.  Stochastic Gradient-Push for Strongly Convex Functions on Time-Varying Directed Graphs , 2014, IEEE Transactions on Automatic Control.

[15]  Michael I. Jordan,et al.  Adding vs. Averaging in Distributed Primal-Dual Optimization , 2015, ICML.

[16]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[17]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[18]  U. Feige,et al.  Spectral Graph Theory , 2015 .

[19]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[20]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[21]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[22]  John N. Tsitsiklis,et al.  Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[23]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[24]  Jack Dongarra,et al.  MPI - The Complete Reference: Volume 1, The MPI Core , 1998 .

[25]  William Gropp,et al.  Mpi---the complete reference: volume 1 , 1998 .

[26]  Michael G. Rabbat,et al.  Multi-agent mirror descent for decentralized stochastic optimization , 2015, 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP).

[27]  M. Raginsky,et al.  Decentralized Online Convex Programming with local information , 2011, Proceedings of the 2011 American Control Conference.

[28]  Michael G. Rabbat,et al.  Consensus-based distributed optimization: Practical issues and applications in large-scale machine learning , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[29]  Michael G. Rabbat,et al.  Push-Sum Distributed Dual Averaging for convex optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[30]  Angelia Nedic,et al.  Distributed Stochastic Subgradient Projection Algorithms for Convex Optimization , 2008, J. Optim. Theory Appl..

[31]  Márk Jelasity,et al.  Gossip-based aggregation in large dynamic networks , 2005, TOCS.

[32]  Reza Olfati-Saber,et al.  Consensus and Cooperation in Networked Multi-Agent Systems , 2007, Proceedings of the IEEE.

[33]  Christoforos N. Hadjicostis,et al.  Distributed Finite-Time Computation of Digraph Parameters: Left-Eigenvector, Out-Degree and Spectrum , 2016, IEEE Transactions on Control of Network Systems.

[34]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[35]  Thomas Hofmann,et al.  Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[36]  Guanghui Lan,et al.  An optimal method for stochastic composite optimization , 2011, Mathematical Programming.

[37]  E. Seneta Non-negative Matrices and Markov Chains , 2008 .

[38]  Ali H. Sayed,et al.  Diffusion Adaptation Strategies for Distributed Optimization and Learning Over Networks , 2011, IEEE Transactions on Signal Processing.

[39]  Peter Richtárik,et al.  Fast distributed coordinate descent for non-strongly convex losses , 2014, 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP).

[40]  Ali H. Sayed,et al.  Adaptive Penalty-Based Distributed Stochastic Convex Optimization , 2013, IEEE Transactions on Signal Processing.

[41]  José M. F. Moura,et al.  Fast Distributed Gradient Methods , 2011, IEEE Transactions on Automatic Control.

[42]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[43]  Y. Singer,et al.  Logarithmic Regret Algorithms for Strongly Convex Repeated Games , 2007 .

[44]  John N. Tsitsiklis,et al.  Convergence Speed in Distributed Consensus and Averaging , 2009, SIAM J. Control. Optim..

[45]  Avi Wigderson,et al.  Entropy waves, the zig-zag graph product, and new constant-degree expanders and extractors , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[46]  Michael G. Rabbat,et al.  Optimization and Analysis of Distributed Averaging With Short Node Memory , 2009, IEEE Transactions on Signal Processing.