GPU-SAM: Leveraging multi-GPU split-and-merge execution for system-wide real-time support

We examine benefits and costs of split-and-merge execution on multi-GPU systems.The split-and-merge execution can improve schedulability on real-time systems.We model schedulability analysis for split-and-merge execution.We propose an algorithm called GPA, to decide the number of GPUs to be used.We demonstrate through evaluations that GPA can improve system-wide schedulability. Multi-GPUs appear as an attractive platform to speed up data-parallel GPGPU computation. The idea of split-and-merge execution has been introduced to accelerate the parallelism of multiple GPUs even further. However, it has not been explored before how to exploit such an idea for real-time multi-GPU systems properly. This paper presents an open-source real-time multi-GPU scheduling framework, called GPU-SAM, that transparently splits each GPGPU application into smaller computation units and executes them in parallel across multiple GPUs, aiming to satisfy real-time constraints. Multi-GPU split-and-merge execution offers the potential for reducing an overall execution time but at the same time brings various different influences on the schedulability of individual applications. Thereby, we analyze the benefit and cost of split-and-merge execution on multiple GPUs and derive schedulability analysis capturing seemingly conflicting influences. We also propose a GPU parallelism assignment policy that determines the multi-GPU mode of each application from the perspective of system-wide schedulability. Our experiment results show that GPU-SAM is able to improve schedulability in real-time multi-GPU systems by relaxing the restriction of launching a kernel on a single GPU only and choosing better multi-GPU execution modes.

[1]  Michael González Harbour,et al.  Exploiting precedence relations in the schedulability analysis of distributed real-time systems , 1999, Proceedings 20th IEEE Real-Time Systems Symposium (Cat. No.99CB37054).

[2]  John Freeman,et al.  From opencl to high-performance hardware on FPGAS , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[3]  Wei Zhang,et al.  Scratchpad Memory Architectures and Allocation Algorithms for Hard Real-Time Multicore Processors , 2015, J. Comput. Sci. Eng..

[4]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Chi-Bang Kuan,et al.  Enabling an OpenCL Compiler for Embedded Multicore DSP Systems , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[6]  Andreas Dietrich,et al.  OptiX: a general purpose ray tracing engine , 2010, SIGGRAPH 2010.

[7]  Giuseppe Lipari,et al.  Improved schedulability analysis of real-time transactions with earliest deadline scheduling , 2005, 11th IEEE Real Time and Embedded Technology and Applications Symposium.

[8]  Doris Chen,et al.  Invited paper: Using OpenCL to evaluate the efficiency of CPUS, GPUS and FPGAS for information filtering , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[9]  James H. Anderson,et al.  Robust Real-Time Multiprocessor Interrupt Handling Motivated by GPUs , 2012, 2012 24th Euromicro Conference on Real-Time Systems.

[10]  Kyoung-Don Kang,et al.  Supporting Preemptive Task Executions and Memory Copies in GPGPUs , 2012, 2012 24th Euromicro Conference on Real-Time Systems.

[11]  Wei Zhang,et al.  Bounding Worst-Case DRAM Performance on Multicore Processors , 2013, J. Comput. Sci. Eng..

[12]  Maurice Steinman,et al.  AMD Fusion APU: Llano , 2012, IEEE Micro.

[13]  Venkatesan Muthukumar,et al.  Energy Aware Scheduling of Aperiodic Real-Time Tasks on Multiprocessor Systems , 2013, J. Comput. Sci. Eng..

[14]  Wei Zhang,et al.  Exploiting Standard Deviation of CPI to Evaluate Architectural Time-Predictability , 2014, J. Comput. Sci. Eng..

[15]  Francisco Tirado,et al.  Multi-GPU based on multicriteria optimization for motion estimation system , 2013, EURASIP Journal on Advances in Signal Processing.

[16]  Wei Zhang,et al.  Multicore-Aware Code Co-Positioning to Reduce WCET on Dual-Core Processors with Shared Instruction Caches , 2012, J. Comput. Sci. Eng..

[17]  Björn Andersson,et al.  Makespan Computation for GPU Threads Running on a Single Streaming Multiprocessor , 2012, 2012 24th Euromicro Conference on Real-Time Systems.

[18]  Lei Zhou,et al.  DART-CUDA: A PGAS Runtime System for Multi-GPU Systems , 2015, 2015 14th International Symposium on Parallel and Distributed Computing.

[19]  Peter M. Athanas,et al.  Enabling development of OpenCL applications on FPGA platforms , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[20]  John A. Clark,et al.  Holistic schedulability analysis for distributed hard real-time systems , 1994, Microprocess. Microprogramming.

[21]  Claus B. Madsen,et al.  A scalable GPU-based approach to shading and shadowing for photorealistic real-time augmented reality , 2007, GRAPP.

[22]  Michael González Harbour,et al.  Schedulability analysis for tasks with static and dynamic offsets , 1998, Proceedings 19th IEEE Real-Time Systems Symposium (Cat. No.98CB36279).

[23]  Wei Zhang,et al.  Two-Level Scratchpad Memory Architectures to Achieve Time Predictability and High Performance , 2014, J. Comput. Sci. Eng..

[24]  Sebastian Hack,et al.  Improving Performance of OpenCL on CPUs , 2012, CC.

[25]  Christophe Jaillet,et al.  MultiGPU computing using MPI or OpenMP , 2010, Proceedings of the 2010 IEEE 6th International Conference on Intelligent Computer Communication and Processing.

[26]  John D. Owens,et al.  Multi-GPU MapReduce on GPU Clusters , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[27]  Wang Yi,et al.  New Response Time Bounds for Fixed Priority Multiprocessor Scheduling , 2009, 2009 30th IEEE Real-Time Systems Symposium.

[28]  Michael González Harbour,et al.  Offset-based response time analysis of distributed systems scheduled under EDF , 2003, 15th Euromicro Conference on Real-Time Systems, 2003. Proceedings..

[29]  James H. Anderson,et al.  GPUSync: A Framework for Real-Time GPU Management , 2013, 2013 IEEE 34th Real-Time Systems Symposium.

[30]  Li Li,et al.  Speculative Parallelism Characterization Profiling in General Purpose Computing Applications , 2015, J. Comput. Sci. Eng..

[31]  R. Govindarajan,et al.  Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices , 2014, CGO '14.

[32]  Darius Burschka,et al.  Efficient occupancy grid computation on the GPU with lidar and radar for road boundary detection , 2010, 2010 IEEE Intelligent Vehicles Symposium.

[33]  Pavan Nagendra Performance characterization of automotive computer vision systems using Graphics Processing Units (GPUs) , 2011, 2011 International Conference on Image Information Processing.

[34]  Keshab K. Parhi,et al.  Semiblind frequency-domain timing synchronization and channel estimation for OFDM systems , 2013, EURASIP J. Adv. Signal Process..

[35]  Jack J. Dongarra,et al.  From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming , 2012, Parallel Comput..

[36]  Wei Zhang,et al.  Multicore Real-Time Scheduling to Reduce Inter-Thread Cache Interferences , 2013, J. Comput. Sci. Eng..

[37]  Eduardo Cabal-Yepez,et al.  Early Experiences with OpenCL on FPGAs: Convolution Case Study , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[38]  James H. Anderson,et al.  Exploring the Multitude of Real-Time Multi-GPU Configurations , 2014, 2014 IEEE Real-Time Systems Symposium.

[39]  Kang G. Shin,et al.  Improvement of Real-Time Multi-CoreSchedulability with Forced Non-Preemption , 2014, IEEE Transactions on Parallel and Distributed Systems.

[40]  Jungwon Kim,et al.  Achieving a single compute device image in OpenCL for multiple GPUs , 2011, PPoPP '11.

[41]  Scott A. Mahlke,et al.  Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[42]  Shinpei Kato,et al.  Gdev: First-Class GPU Resource Management in the Operating System , 2012, USENIX Annual Technical Conference.

[43]  Shinpei Kato,et al.  RGEM: A Responsive GPGPU Execution Model for Runtime Engines , 2011, 2011 IEEE 32nd Real-Time Systems Symposium.

[44]  Jinkyu Lee,et al.  Global EDF Schedulability Analysis for Synchronous Parallel Tasks on Multicore Platforms , 2013, 2013 25th Euromicro Conference on Real-Time Systems.

[45]  Marko Bertogna,et al.  Response-Time Analysis for Globally Scheduled Symmetric Multiprocessor Platforms , 2007, RTSS 2007.