Distributed Stochastic Optimization via Adaptive Stochastic Gradient Descent

Stochastic convex optimization algorithms are the most popular way to train machine learning models on large-scale data. Scaling up the training process of these models is crucial in many applications, but the most popular algorithm, Stochastic Gradient Descent (SGD), is a serial algorithm that is surprisingly hard to parallelize. In this paper, we propose an efficient distributed stochastic optimization method based on adaptive step sizes and variance reduction techniques. We achieve a linear speedup in the number of machines, small memory footprint, and only a small number of synchronization rounds -- logarithmic in dataset size -- in which the computation nodes communicate with each other. Critically, our approach is a general reduction than parallelizes any serial SGD algorithm, allowing us to leverage the significant progress that has been made in designing adaptive SGD algorithms. We conclude by implementing our algorithm in the Spark distributed framework and exhibit dramatic performance gains on large-scale logistic regression problems.

[1]  Sham M. Kakade,et al.  Competing with the Empirical Risk Minimizer in a Single Pass , 2014, COLT.

[2]  Ashok Cutkosky,et al.  Online Learning Without Prior Information , 2017, COLT.

[3]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[4]  John Langford,et al.  Normalized Online Learning , 2013, UAI.

[5]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[6]  Francesco Orabona,et al.  Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning , 2014, NIPS.

[7]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[8]  Michael I. Jordan,et al.  Less than a Single Pass: Stochastically Controlled Stochastic Gradient , 2016, AISTATS.

[9]  Ohad Shamir,et al.  Optimal Distributed Online Prediction , 2011, ICML.

[10]  Tianbao Yang,et al.  Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity , 2015 .

[11]  Yuchen Zhang,et al.  Communication-Efficient Distributed Optimization of Self-Concordant Empirical Loss , 2015, ArXiv.

[12]  Erick Cantú-Paz,et al.  Personalized click prediction in sponsored search , 2010, WSDM '10.

[13]  Francesco Orabona,et al.  Scale-free online learning , 2016, Theor. Comput. Sci..

[14]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[15]  Anastasios Kyrillidis,et al.  Trading-off variance and complexity in stochastic gradient descent , 2016, ArXiv.

[16]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[17]  Nathan Srebro,et al.  Memory and Communication Efficient Distributed Stochastic Optimization with Minibatch Prox , 2017, COLT.

[18]  Alexander J. Smola,et al.  AIDE: Fast and Communication Efficient Distributed Optimization , 2016, ArXiv.

[19]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[20]  Ohad Shamir,et al.  Better Mini-Batch Algorithms via Accelerated Gradient Methods , 2011, NIPS.