Courier: Real-Time Optimal Batch Size Prediction for Latency SLOs in BigDL

Distributed machine learning has seen immense rise in popularity in recent years. Many companies and universities are utilizing computational clusters to train and run machine learning models. Unfortunately, operating such a cluster imposes large costs. It is therefore crucial to attain as high system utilization as possible. Moreover, those who offer computational clusters as a service, apart from keeping high utilization, also have to meet the required Service Level Agreements (SLAs) for the system response time. This becomes increasingly more complex in multitenant scenarios, where the time dedicated to each task has to be limited to achieve fairness. In this work, we analyze how different parameters of the machine learning job influence the response time as well as system utilization and propose Courier. Courier is a model that, based on the type of machine learning job, can select a batch size such that the response time adheres to the Service Level Objectives (SLOs) specified, while also rendering the highest possible accuracy. We gather the data by conducting real-world experiments on a BigDL cluster. Later on, we study the influence of the factors and build several predictive models which lead us to the proposed Courier model.

[1]  Valerio Schiavoni,et al.  PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep Learning Clusters , 2020, Middleware.

[2]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[3]  Yibo Zhu,et al.  A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.

[4]  Sangeetha Abdu Jyothi,et al.  TicTac: Accelerating Distributed Deep Learning with Communication Scheduling , 2018, MLSys.

[5]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[6]  Lidong Zhou,et al.  Astra: Exploiting Predictability to Optimize Deep Learning , 2019, ASPLOS.

[7]  Shaohuai Shi,et al.  Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs , 2017, 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech).

[8]  Amith R. Mamidala,et al.  Efficient Barrier and Allreduce on Infiniband clusters using multicast and adaptive algorithms , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[9]  Byung-Gon Chun,et al.  Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks , 2018, EuroSys.

[10]  Cheng-Fu Chou,et al.  A Model-Based Approach to Streamlining Distributed Training for Asynchronous SGD , 2018, 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).

[11]  Kenneth Heafield,et al.  Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[12]  Hongzhi Wang,et al.  A gray-box performance model for Apache Spark , 2018, Future Gener. Comput. Syst..

[13]  Qingquan Song,et al.  Auto-Keras: An Efficient Neural Architecture Search System , 2018, KDD.

[14]  Julio Delgado,et al.  Elastic Machine Learning Algorithms in Amazon SageMaker , 2020, SIGMOD Conference.

[15]  Leana Golubchik,et al.  Throughput Prediction of Asynchronous SGD in TensorFlow , 2020, ICPE.

[16]  Jan S. Rellermeyer,et al.  Self-adaptive Executors for Big Data Processing , 2019, Middleware.

[17]  Yang Wang,et al.  BigDL: A Distributed Deep Learning Framework for Big Data , 2018, SoCC.

[18]  Olatunji Ruwase,et al.  HyperDrive: exploring hyperparameters with POP scheduling , 2017, Middleware.

[19]  A. Stephen McGough,et al.  Predicting the Computational Cost of Deep Learning Models , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[20]  Abutalib Aghayev,et al.  STRADS-AP: Simplifying Distributed Machine Learning Programming without Introducing a New Programming Model , 2019, USENIX Annual Technical Conference.

[21]  Xiaobo Zhou,et al.  Scalable Distributed DL Training: Batching Communication and Computation , 2019, AAAI.

[22]  Yanan Sun,et al.  DBS: Dynamic Batch Size For Distributed Deep Neural Network Training , 2020, ArXiv.

[23]  Matthias Weidlich,et al.  Crossbow: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers , 2019, Proc. VLDB Endow..

[24]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[25]  Haichen Shen,et al.  Nexus: a GPU cluster engine for accelerating DNN-based video analysis , 2019, SOSP.

[26]  Seunghak Lee,et al.  STRADS: a distributed framework for scheduled model parallel machine learning , 2016, EuroSys.

[27]  Ben He,et al.  A Novel Method for Tuning Configuration Parameters of Spark Based on Machine Learning , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).