Perturbed Iterate Analysis for Asynchronous Stochastic Optimization

We introduce and analyze stochastic optimization methods where the input to each gradient update is perturbed by bounded noise. We show that this framework forms the basis of a unified approach to analyze asynchronous implementations of stochastic optimization algorithms.In this framework, asynchronous stochastic optimization algorithms can be thought of as serial methods operating on noisy inputs. Using our perturbed iterate framework, we provide new analyses of the Hogwild! algorithm and asynchronous stochastic coordinate descent, that are simpler than earlier analyses, remove many assumptions of previous models, and in some cases yield improved upper bounds on the convergence rates. We proceed to apply our framework to develop and analyze KroMagnon: a novel, parallel, sparse stochastic variance-reduced gradient (SVRG) algorithm. We demonstrate experimentally on a 16-core machine that the sparse and parallel version of SVRG is in some cases more than four orders of magnitude faster than the standard SVRG algorithm.

[1]  Michael I. Jordan,et al.  Estimation, Optimization, and Parallelism when Data is Sparse , 2013, NIPS.

[2]  Inderjit S. Dhillon,et al.  PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent , 2015, ICML.

[3]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[4]  Thomas Hofmann,et al.  Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[5]  Martin J. Wainwright,et al.  Randomized Smoothing for Stochastic Optimization , 2011, SIAM J. Optim..

[6]  Haim Avron,et al.  Revisiting Asynchronous Linear Solvers: Provable Convergence Rate through Randomization , 2013, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[7]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.

[8]  Eric P. Xing,et al.  Parallel and Distributed Block-Coordinate Frank-Wolfe Algorithms , 2014, ICML.

[9]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[10]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[11]  Stephen J. Wright,et al.  Asynchronous Stochastic Coordinate Descent: Parallelism and Convergence Properties , 2014, SIAM J. Optim..

[12]  Kunle Olukotun,et al.  Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms , 2015, NIPS.

[13]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[14]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[15]  Marco Rosa,et al.  Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks , 2010, WWW.

[16]  Dimitris S. Papailiopoulos,et al.  Parallel Correlation Clustering on Big Graphs , 2015, NIPS.

[17]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[18]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[19]  Alexander J. Smola,et al.  On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants , 2015, NIPS.

[20]  Mingyi Hong,et al.  A Distributed, Asynchronous, and Incremental Algorithm for Nonconvex Optimization: An ADMM Approach , 2014, IEEE Transactions on Control of Network Systems.

[21]  Elad Hazan,et al.  An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[22]  Stephen J. Wright,et al.  An Asynchronous Parallel Randomized Kaczmarz Algorithm , 2014, ArXiv.

[23]  Ming Yan,et al.  ARock: an Algorithmic Framework for Asynchronous Parallel Coordinate Updates , 2015, SIAM J. Sci. Comput..

[24]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[25]  John N. Tsitsiklis,et al.  Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[26]  Chih-Jen Lin,et al.  A fast parallel SGD for matrix factorization in shared memory systems , 2013, RecSys.

[27]  Peter Richtárik,et al.  Parallel coordinate descent methods for big data optimization , 2012, Mathematical Programming.

[28]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[29]  Sébastien Bubeck,et al.  Theory of Convex Optimization for Machine Learning , 2014, ArXiv.

[30]  Peter J. Haas,et al.  Large-scale matrix factorization with distributed stochastic gradient descent , 2011, KDD.

[31]  Hamid Reza Feyzmahdavian,et al.  An asynchronous mini-batch algorithm for regularized stochastic optimization , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[32]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[33]  Yijun Huang,et al.  Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[34]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[35]  Stephen J. Wright,et al.  An asynchronous parallel stochastic coordinate descent algorithm , 2013, J. Mach. Learn. Res..

[36]  Inderjit S. Dhillon,et al.  NOMAD: Nonlocking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion , 2013, Proc. VLDB Endow..

[37]  Joel A. Tropp,et al.  Factoring nonnegative matrices with linear programs , 2012, NIPS.

[38]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[39]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[40]  Eric P. Xing,et al.  Asynchronous Parallel Block-Coordinate Frank-Wolfe , 2014 .