Online Job Scheduling in Distributed Machine Learning Clusters

Nowadays large-scale distributed machine learning systems have been deployed to support various analytics and intelligence services in IT firms. To train a large dataset and derive the prediction/inference model, e.g., a deep neural network, multiple workers are run in parallel to train partitions of the input dataset, and update shared model parameters. In a shared cluster handling multiple training jobs, a fundamental issue is how to efficiently schedule jobs and set the number of concurrent workers to run for each job, such that server resources are maximally utilized and model training can be completed in time. Targeting a distributed machine learning system using the parameter server framework, $w$ e design an online algorithm for scheduling the arriving jobs and deciding the adjusted numbers of concurrent workers and parameter servers for each job over its course, to maximize overall utility of all jobs, contingent on their completion times. Our online algorithm design utilizes a primal-dual framework coupled with efficient dual subroutines, achieving good long-term performance guarantees with polynomial time complexity. Practical effectiveness of the online algorithm is evaluated using trace-driven simulation and testbed experiments, which demonstrate its outperformance as compared to commonly adopted scheduling algorithms in today's cloud systems.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Karin Strauss,et al.  Accelerating Deep Convolutional Neural Networks Using Specialized Hardware , 2015 .

[3]  Jie Jiang,et al.  Angel: a new large-scale machine learning system , 2018 .

[4]  Bo Li,et al.  Scheduling jobs across geo-distributed datacenters with max-min fairness , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[5]  Zongpeng Li,et al.  An Efficient Cloud Market Mechanism for Computing Jobs With Soft Deadlines , 2017, IEEE/ACM Transactions on Networking.

[6]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[7]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[8]  C. Lang,et al.  The need for speed : Better movement quality during faster task performance after stroke , 2015 .

[9]  Martin Wattenberg,et al.  Ad click prediction: a view from the trenches , 2013, KDD.

[10]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[15]  Mung Chiang,et al.  Need for speed: CORA scheduler for optimizing completion-times in the cloud , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[16]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[17]  Shengen Yan,et al.  Towards Distributed Machine Learning in Shared Clusters: A Dynamically-Partitioned Approach , 2017, 2017 IEEE International Conference on Smart Computing (SMARTCOMP).

[18]  Yossi Azar,et al.  Truthful Online Scheduling with Commitments , 2015, EC.

[19]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[20]  Zongpeng Li,et al.  Online Auctions in IaaS Clouds: Welfare and Profit Maximization With Server Costs , 2015, IEEE/ACM Transactions on Networking.

[21]  Ishai Menache,et al.  Efficient online scheduling for deadline-sensitive jobs: extended abstract , 2013, SPAA.

[22]  Forrest N. Iandola,et al.  FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[24]  Joo Seong Jeong,et al.  Dolphin : Runtime Optimization for Distributed Machine Learning , 2016 .

[25]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[26]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[27]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[28]  Joseph Naor,et al.  The Design of Competitive Online Algorithms via a Primal-Dual Approach , 2009, Found. Trends Theor. Comput. Sci..

[29]  Mor Harchol-Balter,et al.  TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters , 2016, EuroSys.

[30]  Carlo Curino,et al.  Reservation-based Scheduling: If You're Late Don't Blame Us! , 2014, SoCC.

[31]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[32]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[33]  David E. Irwin,et al.  Balancing risk and reward in a market-based task service , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[34]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[35]  Bo Li,et al.  Optimizing coflow completion times with utility max-min fairness , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[36]  Olatunji Ruwase,et al.  Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems , 2015, KDD.

[37]  Haipeng Luo,et al.  Automatic Scaling of Internet Applications for Cloud Computing Services , 2014, IEEE Transactions on Computers.