GAI: A Centralized Tree-Based Scheduler for Machine Learning Workload in Large Shared Clusters

With widespread applications in image recognition, language translation, computer vision and other areas, deep learning (DL) have been proliferating over the past decade. Practitioners from different business groups in industries train DL models on a shared cloud computing infrastructure for these applications with different priorities. During the model training process, one of the key challenges is to minimize the lifecycle of high priority model training jobs. This paper analyzes the distributed training of machine learning (ML) models and identifies short board effect in the training process: GPU training requires higher network bandwidth compared to CPU training. The key insight motivates the design of GAI, a centralized scheduler for ML workload. It relies on two techniques: (1) tree-based structure. The structure stores the cluster information hierarchically to apply multi-layer scheduling. (2) well-extended priority algorithm. We consider priorities from multiple dimensions for model training jobs comprehensively to support resource degradation and preemption. The prototype of GAI is implemented on top of Kubernetes, Kubeflow, and TensorFlow. It is evaluated using a simulator and a real cloud-based cluster. Evaluations show 28% increase in scheduling throughput and 21% training convergence speedup on DL models.

[1]  Jie Jiang,et al.  Angel: a new large-scale machine learning system , 2018 .

[2]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Athanasios V. Vasilakos,et al.  Daphne: A Flexible and Hybrid Scheduling Framework in Multi-Tenant Clusters , 2018, IEEE Transactions on Network and Service Management.

[4]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[5]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[8]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[9]  Fang Dong,et al.  BAR: An Efficient Data Locality Driven Task Scheduling Algorithm for Cloud Computing , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[10]  Kejiang Ye,et al.  Imbalance in the cloud: An analysis on Alibaba cluster trace , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[11]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[12]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[13]  David M. Brooks,et al.  Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[14]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[15]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[16]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[17]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[18]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[19]  Patrick Wendell,et al.  Sparrow: distributed, low latency scheduling , 2013, SOSP.

[20]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[21]  Palash Bera,et al.  How colors in business dashboards affect users' decision making , 2016, Commun. ACM.

[22]  Robert N. M. Watson,et al.  Firmament: Fast, Centralized Cluster Scheduling at Scale , 2016, OSDI.

[23]  Ryan Kastner,et al.  A hardware accelerated system for high throughput cellular image analysis , 2018, J. Parallel Distributed Comput..

[24]  Carlo Curino,et al.  Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters , 2015, USENIX Annual Technical Conference.

[25]  Anne-Marie Kermarrec,et al.  Hawk: Hybrid Datacenter Scheduling , 2015, USENIX Annual Technical Conference.