Litz: Elastic Framework for High-Performance Distributed Machine Learning

Machine Learning (ML) is an increasingly popular application in the cloud and data-center, inspiring new algorithmic and systems techniques that leverage unique properties of ML applications to improve their distributed performance by orders of magnitude. However, applications built using these techniques tend to be static, unable to elastically adapt to the changing resource availability that is characteristic of multi-tenant environments. Existing distributed frameworks are either inelastic, or offer programming models which are incompatible with the techniques employed by high-performance ML applications. Motivated by these trends, we present Litz, an elastic framework supporting distributed ML applications. We categorize the wide variety of techniques employed by these applications into three general themes -- stateful workers, model scheduling, and relaxed consistency -- which are collectively supported by Litz's programming model. Our implementation of Litz's execution system transparently enables elasticity and low-overhead execution. We implement several popular ML applications using Litz, and show that they can scale in and out quickly to adapt to changing resource availability, as well as how a scheduler can leverage elasticity for faster job completion and more efficient resource allocation. Lastly, we show that Litz enables elasticity without compromising performance, achieving competitive performance with state-of-the-art non-elastic ML frameworks.

[1]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[2]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Lawrence Carin,et al.  Sparse multinomial logistic regression: fast algorithms and generalization bounds , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Gil Neiger,et al.  Causal memory: definitions, implementation, and programming , 1995, Distributed Computing.

[5]  Jieping Ye,et al.  Large-scale sparse logistic regression , 2009, KDD.

[6]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[8]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX Annual Technical Conference.

[9]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[10]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[11]  Peter J. Haas,et al.  Large-scale matrix factorization with distributed stochastic gradient descent , 2011, KDD.

[12]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[13]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[14]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[15]  Ambuj Tewari,et al.  Feature Clustering for Accelerating Parallel Coordinate Descent , 2012, NIPS.

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Ion Stoica,et al.  Probabilistically Bounded Staleness for Practical Partial Quorums , 2012, Proc. VLDB Endow..

[18]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[19]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[20]  Aaron Q. Li,et al.  Parameter Server for Distributed Machine Learning , 2013 .

[21]  Jian Li,et al.  Migration-Based Elastic Consolidation Scheduling in Cloud Data Center , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems Workshops.

[22]  Seunghak Lee,et al.  Solving the Straggler Problem with Bounded Staleness , 2013, HotOS.

[23]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[24]  Jiaxing Zhang,et al.  Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning , 2014 .

[25]  Matthew J. Streeter,et al.  Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning , 2014, NIPS.

[26]  Babak Shahbaba,et al.  Distributed Stochastic Gradient MCMC , 2014, ICML.

[27]  James T. Kwok,et al.  Asynchronous Distributed ADMM for Consensus Optimization , 2014, ICML.

[28]  Inderjit S. Dhillon,et al.  NOMAD: Nonlocking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion , 2013, Proc. VLDB Endow..

[29]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[30]  E. ∑tmNw Appendix Fugue : Slow-Worker-Agnostic Distributed Learning for Big Models on Big Data , 2014 .

[31]  Yaoliang Yu,et al.  Petuum: A New Platform for Distributed Machine Learning on Big Data , 2013, IEEE Transactions on Big Data.

[32]  Tie-Yan Liu,et al.  LightLDA: Big Topic Models on Modest Computer Clusters , 2014, WWW.

[33]  Prateek Sharma,et al.  SpotOn: a batch computing service for the spot market , 2015, SoCC.

[34]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[35]  Yaoliang Yu,et al.  Distributed Machine Learning via Sufficient Factor Broadcasting , 2015, ArXiv.

[36]  Eric P. Xing,et al.  Managed communication and consistency for fast data-parallel iterative analytics , 2015, SoCC.

[37]  Eric P. Xing,et al.  High-Performance Distributed ML at Scale through Parameter Server Consistency Models , 2014, AAAI.

[38]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[39]  Xin He,et al.  Flint: batch-interactive data-intensive processing on transient servers , 2016, EuroSys.

[40]  Seunghak Lee,et al.  STRADS: a distributed framework for scheduled model parallel machine learning , 2016, EuroSys.

[41]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[42]  Yang Chen,et al.  TR-Spark: Transient Computing for Big Data Analytics , 2016, SoCC.

[43]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[44]  Luke M. Leslie,et al.  Supporting On-demand Elasticity in Distributed Graph Processing , 2016, 2016 IEEE International Conference on Cloud Engineering (IC2E).

[45]  Srikanth Kandula,et al.  This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Graphene: Packing and Dependency-aware Scheduling for Data-parallel Clusters G: Packing and Dependency-aware Scheduling for Data-parallel Clusters , 2022 .

[46]  Jin Kyu Kim,et al.  Benchmarking Apache Spark with Machine Learning Applications , 2016 .

[47]  Aditya Akella,et al.  Altruistic Scheduling in Multi-Resource Clusters , 2016, OSDI.

[48]  Eric P. Xing,et al.  Addressing the straggler problem for iterative convergent parallel ML , 2016, SoCC.

[49]  Kevin Duh,et al.  DyNet: The Dynamic Neural Network Toolkit , 2017, ArXiv.

[50]  Gregory R. Ganger,et al.  Proteus: agile ML elasticity through tiered reliability in dynamic resource markets , 2017, EuroSys.

[51]  Mingyi Hong,et al.  A Distributed, Asynchronous, and Incremental Algorithm for Nonconvex Optimization: An ADMM Approach , 2014, IEEE Transactions on Control of Network Systems.