FusionCL: a machine-learning based approach for OpenCL kernel fusion to increase system performance

Employing general-purpose graphics processing units (GPGPU) with the help of OpenCL has resulted in greatly reducing the execution time of data-parallel applications by taking advantage of the massive available parallelism. However, when a small data size application is executed on GPU there is a wastage of GPU resources as the application cannot fully utilize GPU compute-cores. There is no mechanism to share a GPU between two kernels due to the lack of operating system support on GPU. In this paper, we propose the provision of a GPU sharing mechanism between two kernels that will lead to increasing GPU occupancy, and as a result, reduce execution time of a job pool. However, if a pair of the kernel is competing for the same set of resources (i.e., both applications are compute-intensive or memory-intensive), kernel fusion may also result in a significant increase in execution time of fused kernels. Therefore, it is pertinent to select an optimal pair of kernels for fusion that will result in significant speedup over their serial execution. This research presents FusionCL, a machine learning-based GPU sharing mechanism between a pair of OpenCL kernels. FusionCL identifies each pair of kernels (from the job pool), which are suitable candidates for fusion using a machine learning-based fusion suitability classifier. Thereafter, from all the candidates, it selects a pair of candidate kernels that will produce maximum speedup after fusion over their serial execution using a fusion speedup predictor. The experimental evaluation shows that the proposed kernel fusion mechanism reduces execution time by 2.83× when compared to a baseline scheduling scheme. When compared to state-of-the-art, the reduction in execution time is up to 8%.

[1]  Michael F. P. O'Boyle,et al.  Portable and transparent software managed scheduling on accelerators for fair resource sharing , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[2]  Yuan Wen Multi-tasking scheduling for heterogeneous systems , 2017 .

[3]  J. Friedman Stochastic gradient boosting , 2002 .

[4]  Michael F. P. O'Boyle,et al.  Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[5]  Daniel J. Sorin,et al.  Exploring memory consistency for massively-threaded throughput-oriented processors , 2013, ISCA.

[6]  Yifan Sun,et al.  Valkyrie: Leveraging Inter-TLB Locality to Enhance GPU Performance , 2020, PACT.

[7]  Kevin Skadron,et al.  Dynamic Heterogeneous Scheduling Decisions Using Historical Runtime Data , 2011 .

[8]  Neil C. Thompson,et al.  The decline of computers as a general purpose technology , 2021, Commun. ACM.

[9]  Mahmut T. Kandemir,et al.  Anatomy of GPU Memory System for Multi-Application Execution , 2015, MEMSYS.

[10]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[11]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[12]  Wei Jiang,et al.  Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[13]  Schahram Dustdar,et al.  Optimized container scheduling for data-intensive serverless edge computing , 2021, Future Gener. Comput. Syst..

[14]  Volker Lindenstruth,et al.  An Energy-Efficient Multi-GPU Supercomputer , 2014, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS).

[15]  Xiaokang Yang,et al.  GPU accelerated high-quality video/image super-resolution , 2016, 2016 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB).

[16]  Michail Papadimitriou,et al.  Multiple-tasks on multiple-devices (MTMD): exploiting concurrency in heterogeneous managed runtimes , 2021, VEE.

[17]  Jianlong Zhong,et al.  Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling , 2013, IEEE Transactions on Parallel and Distributed Systems.

[18]  Tulika Mitra,et al.  Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[19]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[20]  Zhongliang Chen,et al.  MGPUSim: Enabling Multi-GPU Performance Modeling and Optimization , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[21]  Thomas Fahringer,et al.  An automatic input-sensitive approach for heterogeneous task partitioning , 2013, ICS '13.

[22]  Michael F. P. O'Boyle,et al.  MaxPair: Enhance OpenCL Concurrent Kernel Execution by Weighted Maximum Matching , 2018, GPGPU@PPoPP.

[23]  Guojie Luo,et al.  Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion , 2017, FPGA.

[24]  Michael F. P. O'Boyle,et al.  Merge or Separate?: Multi-job Scheduling for OpenCL Kernels on CPU/GPU Platforms , 2017, GPGPU@PPoPP.

[25]  Muhammad Arshad Islam,et al.  RALB‐HC: A resource‐aware load balancer for heterogeneous cluster , 2019 .

[26]  Andreas Kopmann,et al.  Balancing Load of GPU Subsystems to Accelerate Image Reconstruction in Parallel Beam Tomography , 2018, 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[27]  Gérard Biau,et al.  Accelerated gradient boosting , 2018, Machine Learning.

[28]  Duksu Kim,et al.  HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs , 2020, Computing.

[29]  Sachin Singh Gautam,et al.  GPU-based matrix-free finite element solver exploiting symmetry of elemental matrices , 2020, Computing.

[30]  Carole-Jean Wu,et al.  Performance characterization, prediction, and optimization for heterogeneous systems with multi-level memory interference , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[31]  Rachata Ausavarungnirun,et al.  MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency , 2018, ASPLOS.

[32]  Keshav Pingali,et al.  Adaptive heterogeneous scheduling for integrated GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[33]  Randal S. Olson,et al.  Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science , 2016, GECCO.

[34]  Jong-Myon Kim,et al.  An efficient scheduling scheme using estimated execution time for heterogeneous computing systems , 2013, The Journal of Supercomputing.

[35]  Muhammad Arshad Islam,et al.  Troodon: A machine-learning based load-balancing application scheduler for CPU-GPU system , 2019, J. Parallel Distributed Comput..

[36]  R. Govindarajan,et al.  Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.

[37]  Laxmi N. Bhuyan,et al.  CuMAS: Data Transfer Aware Multi-Application Scheduling for Shared GPUs , 2016, ICS.

[38]  Radu Prodan,et al.  E-OSched: a load balancing scheduler for heterogeneous multicores , 2018, The Journal of Supercomputing.

[39]  Kevin Skadron,et al.  Load balancing in a changing world: dealing with heterogeneity and performance variability , 2013, CF '13.

[40]  Ramón Beivide,et al.  Simplifying programming and load balancing of data parallel applications on heterogeneous systems , 2016, GPGPU@PPoPP.

[41]  Michael F. P. O'Boyle,et al.  A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL , 2011, CC.