Data-Dependent Convergence for Consensus Stochastic Optimization

We study a distributed consensus-based stochastic gradient descent (SGD) algorithm and show that the rate of convergence involves the spectral properties of two matrices: The standard spectral gap of a weight matrix from the network topology and a new term depending on the spectral norm of the sample covariance matrix of the data. This data-dependent convergence rate shows that distributed SGD algorithms perform better on datasets with small spectral norm. Our analysis method also allows us to find data-dependent convergence rates as we limit the amount of communication. Spreading a fixed amount of data across more nodes slows convergence; for asymptotically growing datasets, we show that adding more machines can help when minimizing twice-differentiable losses.

[1]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[2]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[3]  Michael G. Rabbat,et al.  Communication/Computation Tradeoffs in Consensus-Based Distributed Optimization , 2012, NIPS.

[4]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[5]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[6]  Stephen J. Wright,et al.  An asynchronous parallel stochastic coordinate descent algorithm , 2013, J. Mach. Learn. Res..

[7]  Qing Ling,et al.  EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization , 2014, 1404.6264.

[8]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[9]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[10]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[11]  Angelia Nedic,et al.  Distributed Stochastic Subgradient Projection Algorithms for Convex Optimization , 2008, J. Optim. Theory Appl..

[12]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[13]  Avleen Singh Bijral,et al.  Mini-Batch Primal and Dual Methods for SVMs , 2013, ICML.

[14]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[15]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[16]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[17]  Pascal Bianchi,et al.  Performance of a Distributed Stochastic Approximation Algorithm , 2012, IEEE Transactions on Information Theory.

[18]  Massih-Reza Amini,et al.  Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization , 2009, NIPS.

[19]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[20]  Xiaoqiao Meng,et al.  Real-time forest fire detection with wireless sensor networks , 2005, Proceedings. 2005 International Conference on Wireless Communications, Networking and Mobile Computing, 2005..

[21]  Ohad Shamir,et al.  Stochastic Convex Optimization , 2009, COLT.

[22]  On Doubly Stochastic Graph Optimization , 2009 .

[23]  Stephen P. Boyd,et al.  Fastest Mixing Markov Chain on a Graph , 2004, SIAM Rev..

[24]  Denis J. Dean,et al.  Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables , 1999 .

[25]  Michael G. Rabbat,et al.  Distributed strongly convex optimization , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[26]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[27]  Aryan Mokhtari,et al.  DSA: Decentralized Double Stochastic Averaging Gradient Algorithm , 2015, J. Mach. Learn. Res..

[28]  Joseph K. Bradley,et al.  Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[29]  Ohad Shamir,et al.  Better Mini-Batch Algorithms via Accelerated Gradient Methods , 2011, NIPS.