Troodon: A machine-learning based load-balancing application scheduler for CPU-GPU system

Abstract Heterogeneous computing machines consisting of a CPU and one or more GPUs are increasingly being used today because of their higher performance-cost ratio and lower energy consumption. To program such heterogeneous systems, OpenCL has become an industry standard due to the portability across various computing architectures. To exploit the computing capabilities of heterogeneous systems, application developers are porting their cluster and Cloud applications using OpenCL. With the increasing number of such applications, the use of shared accelerating computing devices (such as CPUs and GPUs) should be managed using an efficient load-balancing scheduling heuristic capable of reducing execution time, increasing throughput with high device utilization. Mostly, the OpenCL applications are suited (execute faster) on a specific computing device (CPU or GPU) and with varying data-sizes the speedup obtained by an application on the suitable device varies too. Applications’ mapping to computing devices without considering device suitability and obtainable speedup on a suitable device leads to sub-optimal execution time, lower throughput and load imbalance. Therefore, an application scheduler should consider both the device-suitability and speedup variation for scheduling decisions leading to a reduction in execution time and an increase in throughput. In this paper, we present a novel load-balancing scheduling heuristic named as Troodon that considers machine-learning based device-suitability model that classify OpenCL applications into either CPU suitable or GPU suitable. Moreover, a speedup predictor that predicts the amount of speedup that jobs will obtain when executed on a suitable device is also part of the Troodon. Troodon incorporates the E-OSched scheduling mechanism to map jobs on CPU and GPUs in a load balanced way. This results in reduced applications execution time, increased system throughput, and improved device utilization. We evaluate the proposed scheduler using a large number of data-parallel applications and compared with several other state-of-the-art scheduling heuristics. The experimental evaluation has demonstrated that the proposed scheduler outperformed the existing heuristics and reduced the application execution time up to 38% with increased system throughput and device utilization.

[1]  Carole-Jean Wu,et al.  Performance characterization, prediction, and optimization for heterogeneous systems with multi-level memory interference , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[2]  Zheng Wang,et al.  Adaptive optimization for OpenCL programs on embedded heterogeneous systems , 2017, LCTES.

[3]  Ramón Beivide,et al.  Simplifying programming and load balancing of data parallel applications on heterogeneous systems , 2016, GPGPU@PPoPP.

[4]  Holger Fröning,et al.  Metric Selection for GPU Kernel Classification , 2019, ACM Trans. Archit. Code Optim..

[5]  Daniel J. Sorin,et al.  Exploring memory consistency for massively-threaded throughput-oriented processors , 2013, ISCA.

[6]  Wei Jiang,et al.  Scheduling concurrent applications on a cluster of CPU-GPU nodes , 2013, Future Gener. Comput. Syst..

[7]  Radu Prodan,et al.  E-OSched: a load balancing scheduler for heterogeneous multicores , 2018, The Journal of Supercomputing.

[8]  Ramón Beivide,et al.  Energy efficiency of load balancing for data-parallel applications in heterogeneous systems , 2016, The Journal of Supercomputing.

[9]  Ozcan Ozturk,et al.  Effective Kernel Mapping for OpenCL Applications in Heterogeneous Platforms , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[10]  Ana Lucia Varbanescu,et al.  A Beginner's Guide to Estimating and Improving Performance Portability , 2018, ISC Workshops.

[11]  Michael F. P. O'Boyle,et al.  Merge or Separate?: Multi-job Scheduling for OpenCL Kernels on CPU/GPU Platforms , 2017, GPGPU@PPoPP.

[12]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[13]  Michael F. P. O'Boyle,et al.  Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[14]  Wu-chun Feng,et al.  Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL , 2015, 2015 IEEE International Conference on Cluster Computing.

[15]  Wu-chun Feng,et al.  MultiCL: Enabling automatic scheduling for task-parallel workloads in OpenCL , 2016, Parallel Comput..

[16]  Michael F. P. O'Boyle,et al.  A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL , 2011, CC.

[17]  Denis Barthou,et al.  Automatic OpenCL Task Adaptation for Heterogeneous Architectures , 2016, Euro-Par.

[18]  Kevin Skadron,et al.  Load balancing in a changing world: dealing with heterogeneity and performance variability , 2013, CF '13.

[19]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[20]  Keshav Pingali,et al.  Adaptive heterogeneous scheduling for integrated GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[21]  José Luis Bosque,et al.  Cooperative CPU, GPU, and FPGA heterogeneous execution with EngineCL , 2019, The Journal of Supercomputing.

[22]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[23]  Jong-Myon Kim,et al.  An efficient scheduling scheme using estimated execution time for heterogeneous computing systems , 2013, The Journal of Supercomputing.

[24]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[25]  Laxmi N. Bhuyan,et al.  A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures , 2013, TACO.

[26]  Kevin Skadron,et al.  Dynamic Heterogeneous Scheduling Decisions Using Historical Runtime Data , 2011 .

[27]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[28]  Marco Platzner,et al.  Performance-centric scheduling with task migration for a heterogeneous compute node in the data center , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[29]  Michael F. P. O'Boyle,et al.  Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems , 2014, ACM Trans. Archit. Code Optim..

[30]  Randal S. Olson,et al.  Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science , 2016, GECCO.

[31]  Scott A. Mahlke,et al.  Orchestrating Multiple Data-Parallel Kernels on Multiple Devices , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).