Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem

Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA A100 GPUs that partitions one physical GPU into multiple GPU instances. WithMIG, A100 can be the most costefficient GPU ever for serving Deep Neural Networks (DNNs). However, discovering the most efficient GPU partitions is challenging. The underlying problem is NP-hard; moreover, it is a new abstract problem, which we define as the Reconfigurable Machine Scheduling Problem (RMS). This paper studies serving DNNs with MIG, a new case of RMS. We further propose a solution, MIG-serving. MIGserving is an algorithm pipeline that blends a variety of newly designed algorithms and customized classic algorithms, including a heuristic greedy algorithm, Genetic Algorithm (GA), and Monte Carlo Tree Search algorithm (MCTS). We implement MIG-serving on Kubernetes. Our experiments show that compared to using A100 as-is, MIG-serving can save up to 40% GPUs while providing the same throughput.

[1]  Kang G. Shin,et al.  Tiresias: A GPU Cluster Manager for Distributed Deep Learning , 2019, NSDI.

[2]  Francis C. M. Lau,et al.  HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees , 2020, OSDI.

[3]  Wu Cheng,et al.  A genetic algorithm for minimizing the makespan in the case of scheduling identical parallel machines , 1999, Artif. Intell. Eng..

[4]  Biao Guo,et al.  Resource Partitioning and Application Scheduling with Module Merging on Dynamically and Partially Reconfigurable FPGAs , 2020, Electronics.

[5]  Reza Tavakkoli-Moghaddam,et al.  Flexible job shop scheduling problem with reconfigurable machine tools: An improved differential evolution algorithm , 2020, Appl. Soft Comput..

[6]  Mateusz Gorczyca,et al.  The discrete part of the discrete-continuous scheduling problems - new properties , 2009 .

[7]  Zhen Zhang,et al.  PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications , 2020, OSDI.

[8]  Marco Platzner,et al.  Operating systems for reconfigurable embedded platforms: online scheduling of real-time tasks , 2004, IEEE Transactions on Computers.

[9]  Yibo Zhu,et al.  A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.

[10]  Imed Kacem,et al.  Genetic algorithm for the flexible job-shop scheduling problem , 2003, SMC'03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH37483).

[11]  Grzegorz Waligóra,et al.  Solving Discrete-Continuous Scheduling Problems by Tabu Search , 2001 .

[12]  Haichen Shen,et al.  Nexus: a GPU cluster engine for accelerating DNN-based video analysis , 2019, SOSP.

[13]  R. Landers,et al.  Reconfigurable machine tools , 2001 .

[14]  Chuan Wu,et al.  Optimus: an efficient dynamic resource scheduler for deep learning clusters , 2018, EuroSys.

[15]  Qian Li,et al.  INFaaS: A Model-less Inference Serving System. , 2019 .

[16]  Eric Schkufza,et al.  Sharing, Protection, and Compatibility for Reconfigurable Fabric with AmorphOS , 2018, OSDI.

[17]  Wencong Xiao,et al.  Gandiva: Introspective Cluster Scheduling for Deep Learning , 2018, OSDI.

[18]  Amar Phanishayee,et al.  Themis: Fair and Efficient GPU Cluster Scheduling , 2020, NSDI.

[19]  Lingfan Yu,et al.  Low latency RNN inference with cellular batching , 2018, EuroSys.

[20]  Gu Jin,et al.  SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping , 2020, ASPLOS.

[21]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[22]  Csaba I. Fábián,et al.  Cutting-Stock Problem , 2009, Encyclopedia of Optimization.

[23]  Nipun Kwatra,et al.  Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning , 2020, EuroSys.

[24]  Ymir Vigfusson,et al.  Serving DNNs like Clockwork: Performance Predictability from the Bottom Up , 2020, OSDI.

[25]  Nikolaj Bjørner,et al.  Z3: An Efficient SMT Solver , 2008, TACAS.

[26]  Amar Phanishayee,et al.  Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads , 2020, OSDI.

[27]  F. Frank Chen,et al.  Unrelated parallel machine scheduling with setup times using simulated annealing , 2002 .

[28]  Eric P. Xing,et al.  Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning , 2020, OSDI.

[29]  Klaus Jansen,et al.  Approximation Algorithms for Scheduling with Class Constraints , 2019, SPAA.

[30]  Jiawei Zhang,et al.  Parallel machine scheduling with splitting jobs , 2000, Discret. Appl. Math..

[31]  Xin Wang,et al.  Clipper: A Low-Latency Online Prediction Serving System , 2016, NSDI.

[32]  Christopher Olston,et al.  TensorFlow-Serving: Flexible, High-Performance ML Serving , 2017, ArXiv.

[33]  Aditya Akella,et al.  Accelerating Deep Learning Inference via Learned Caches , 2021, ArXiv.

[34]  Ahmed Azab,et al.  Modelling the Problem of Production Scheduling for Reconfigurable Manufacturing Systems , 2015 .