More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

We propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML algorithms, while still providing correctness guarantees. The parameter server provides an easy-to-use shared interface for read/write access to an ML model's values (parameters and variables), and the SSP model allows distributed workers to read older, stale versions of these values from a local cache, instead of waiting to get them from a central storage. This significantly increases the proportion of time workers spend computing, as opposed to waiting. Furthermore, the SSP model ensures ML algorithm correctness by limiting the maximum age of the stale values. We provide a proof of correctness under SSP, as well as empirical results demonstrating that the SSP model achieves faster algorithm convergence on several different ML problems, compared to fully-synchronous and asynchronous schemes.

[1]  W. Stewart Joseph , 2002, The Psychological clinic.

[2]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[3]  P. Strevens Iii , 1985 .

[4]  Colin J. Fidge,et al.  Timestamps in Message-Passing Systems That Preserve the Partial Ordering , 1988 .

[5]  Amin Vahdat,et al.  Design and evaluation of a conit-based continuous consistency model for replicated services , 2002, TOCS.

[6]  Heiko Schuldt,et al.  FAS - A Freshness-Sensitive Coordination Middleware for a Cluster of OLAP Components , 2002, VLDB.

[7]  Louiqa Raschid,et al.  Using Latency-Recency Profiles for Data Delivery on the Web , 2002, VLDB.

[8]  Friedemann Mattern,et al.  Virtual Time and Global States of Distributed Systems , 2002 .

[9]  Alexandros Labrinidis,et al.  Balancing Performance and Data Freshness in Web Database Servers , 2003, VLDB.

[10]  Verónika Peralta,et al.  A framework for analysis of data freshness , 2004, IQIS '04.

[11]  Chin-Tser Huang,et al.  LOFT: Low-Overhead Freshness Transmission in Sensor Networks , 2008, 2008 IEEE International Conference on Sensor Networks, Ubiquitous, and Trustworthy Computing (sutc 2008).

[12]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.

[13]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[14]  Jinyang Li,et al.  Piccolo: Building Fast, Distributed Programs with Partitioned Tables , 2010, OSDI.

[15]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[16]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[17]  Joseph K. Bradley,et al.  Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[18]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[19]  Theodore Johnson,et al.  Consistency in a Stream Warehouse , 2011, CIDR.

[20]  Peter J. Haas,et al.  Large-scale matrix factorization with distributed stochastic gradient descent , 2011, KDD.

[21]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[22]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[23]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[24]  Kimberly Keeton,et al.  LazyBase: trading freshness for performance in a scalable database , 2012, EuroSys '12.

[25]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[26]  Alexander J. Smola,et al.  Scalable inference in latent variable models , 2012, WSDM '12.

[27]  Doug Terry,et al.  Replicated data consistency explained through baseball , 2013, CACM.

[28]  Seunghak Lee,et al.  Solving the Straggler Problem with Bounded Staleness , 2013, HotOS.