Chimera: Collaborative Preemption for Multitasking on a Shared GPU

The demand for multitasking on graphics processing units (GPUs) is constantly increasing as they have become one of the default components on modern computer systems along with traditional processors (CPUs). Preemptive multitasking on CPUs has been primarily supported through context switching. However, the same preemption strategy incurs substantial overhead due to the large context in GPUs. The overhead comes in two dimensions: a preempting kernel suffers from a long preemption latency, and the system throughput is wasted during the switch. Without precise control over the large preemption overhead, multitasking on GPUs has little use for applications with strict latency requirements. In this paper, we propose Chimera, a collaborative preemption approach that can precisely control the overhead for multitasking on GPUs. Chimera first introduces streaming multiprocessor (SM) flushing, which can instantly preempt an SM by detecting and exploiting idempotent execution. Chimera utilizes flushing collaboratively with two previously proposed preemption techniques for GPUs, namely context switching and draining to minimize throughput overhead while achieving a required preemption latency. Evaluations show that Chimera violates the deadline for only 0.2% of preemption requests when a 15us preemption latency constraint is used. For multi-programmed workloads, Chimera can improve the average normalized turnaround time by 5.5x, and system throughput by 12.2%.

[1]  Tarek A. El-Ghazawi,et al.  Exploiting concurrent kernel execution on graphic processing units , 2011, 2011 International Conference on High Performance Computing & Simulation.

[2]  Amin Ansari,et al.  Encore: Low-cost, fine-grained transient fault recovery , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  R. Govindarajan,et al.  Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.

[4]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[5]  Mateo Valero,et al.  Enabling preemptive multiprogramming on GPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[6]  Karthikeyan Sankaralingam,et al.  Idempotent processor architecture , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[8]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[9]  Xiangrong Zhou,et al.  Rapid and low-cost context-switch through embedded processor customization for real-time and control applications , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[10]  John Kim,et al.  Improving GPGPU resource utilization through alternative thread block scheduling , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[11]  Nam Sung Kim,et al.  The case for GPGPU spatial multitasking , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[12]  Jack J. Purdum,et al.  C programming guide , 1983 .

[13]  Siddhartha Shivshankar,et al.  Balancing register pressure and context-switching delays in ASTI systems , 2005, CASES '05.

[14]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[15]  Francisco J. Cazorla,et al.  FAME: FAirly MEasuring Multithreaded Architectures , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[16]  Shinpei Kato,et al.  RGEM: A Responsive GPGPU Execution Model for Runtime Engines , 2011, 2011 IEEE 32nd Real-Time Systems Symposium.

[17]  Brian B. Moore,et al.  The IBM System/370 Vector Architecture: Design Considerations , 1988, IEEE Trans. Computers.

[18]  Dean M. Tullsen,et al.  Initial observations of the simultaneous multithreading Pentium 4 processor , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[19]  Michael Hind,et al.  Pointer analysis: haven't we solved this problem yet? , 2001, PASTE '01.

[20]  Kevin Skadron,et al.  Fine-grained resource sharing for concurrent GPGPU kernels , 2012, HotPar'12.

[21]  Babak Falsafi,et al.  Reference idempotency analysis: a framework for optimizing speculative execution , 2001, PPoPP '01.

[22]  David B. Whalley,et al.  Fast context switches: compiler and architectural support for preemptive scheduling , 1995, Microprocess. Microsystems.

[23]  Kyoung-Don Kang,et al.  Supporting Preemptive Task Executions and Memory Copies in GPGPUs , 2012, 2012 24th Euromicro Conference on Real-Time Systems.

[24]  Karthikeyan Sankaralingam,et al.  iGPU: Exception support and speculative execution on GPUs , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[25]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[26]  Fumihiko Ino,et al.  Cooperative Multitasking for GPU-Accelerated Grid Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[27]  Karthikeyan Sankaralingam,et al.  Relax: an architectural framework for software recovery of hardware faults , 2010, ISCA.

[28]  Shinpei Kato,et al.  TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments , 2011, USENIX Annual Technical Conference.

[29]  Stijn Eyerman,et al.  System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.