Distributed Machine Learning with a Serverless Architecture

The need to scale up machine learning, in the presence of a rapid growth of data both in volume and in variety, has sparked broad interests to develop distributed machine learning systems, typically based on parameter servers. However, since these systems are based on a dedicated cluster of physical or virtual machines, they have posed non-trivial cluster management overhead to machine learning practitioners and data scientists. In addition, there exists an inherent mismatch between the dynamically varying resource demands during a model training job and the inflexible resource provisioning model of current cluster-based systems.In this paper, we propose SIREN, an asynchronous distributed machine learning framework based on the emerging serverless architecture, with which stateless functions can be executed in the cloud without the complexity of building and maintaining virtual machine infrastructures. With SIREN, we are able to achieve a higher level of parallelism and elasticity by using a swarm of stateless functions, each working on a different batch of data, while greatly reducing system configuration overhead. Furthermore, we propose a scheduler based on Deep Reinforcement Learning to dynamically control the number and memory size of the stateless functions that should be used in each training epoch. The scheduler learns from the training process itself, in pursuit for the minimum possible training time given a cost. With our real-world prototype implementation on AWS Lambda, extensive experimental results have shown that SIREN can reduce model training time by up to 44%, as compared to traditional machine learning training benchmarks on AWS EC2 at the same cost.

[1]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[2]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[3]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[4]  Rubby Casallas,et al.  Infrastructure Cost Comparison of Running Web Applications in the Cloud Using AWS Lambda and Monolithic and Microservice Architectures , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[5]  Ion Stoica,et al.  Occupy the cloud: distributed computing for the 99% , 2017, SoCC.

[6]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[7]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[8]  Andrea C. Arpaci-Dusseau,et al.  Serverless Computation with OpenLambda , 2016, HotCloud.

[9]  Anirudh Sivaraman,et al.  Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads , 2017, NSDI.

[10]  Joseph K. Bradley,et al.  Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[11]  Yijun Huang,et al.  Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[12]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 2005, IEEE Transactions on Neural Networks.

[13]  Michael J. Freedman,et al.  SLAQ: quality-driven scheduling for distributed machine learning , 2017, SoCC.

[14]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[15]  Samy Bengio,et al.  Device Placement Optimization with Reinforcement Learning , 2017, ICML.

[16]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.