Interference from GPU System Service Requests

Heterogeneous systems combine general-purpose CPUs with domain-specific accelerators like GPUs. Recent heterogeneous system designs have enabled GPUs to request OS services, but the domain-specific nature of accelerators means that they must rely on the CPUs to handle these requests. Such system service requests can unintentionally harm the performance of unrelated CPU applications. Tests on a real heterogeneous processor demonstrate that GPU system service requests can degrade contemporaneous CPU application performance by up to 44% and can reduce energy efficiency by limiting CPU sleep time. The reliance on busy CPU cores to perform the system services can also slow down GPU work by up to 18%. This new form of interference is found only in accelerator-rich heterogeneous designs and may be exacerbated in future systems with more accelerators. We explore mitigation strategies from other fields that, in the face of such interference, can increase CPU and GPU performance by over 20% and $2 \times$, respectively, and CPU sleep time by $4.8 \times$. However, these strategies do not always help and offer no performance guarantees. We therefore describe a technique to guarantee quality of service to CPU workloads by dynamically adding backpressure to GPU requests.

[1]  T Moody Adam,et al.  System Noise Revisited: Enabling Application Scalability and Reproducibility with SMT , 2016 .

[2]  Kathirgamar Aingaran,et al.  Software in Silicon in the Oracle SPARC M7 processor , 2016, 2016 IEEE Hot Chips 28 Symposium (HCS).

[3]  Ruud Haring,et al.  The Blue Gene/Q Compute chip , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[4]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[5]  Ján Veselý,et al.  Generic System Calls for GPUs , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[6]  Mahmut T. Kandemir,et al.  VIP: Virtualizing IP chains on handheld platforms , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[7]  T. Forshaw Everything you always wanted to know , 1977 .

[8]  Kenneth A. Ross,et al.  Q100: the architecture and design of a database processing unit , 2014, ASPLOS.

[9]  Christian Bienia,et al.  PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors , 2009 .

[10]  Thomas F. Wenisch,et al.  Unlocking bandwidth for GPUs in CC-NUMA systems , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[11]  David A. Wood,et al.  Border control: Sandboxing accelerators , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  Greg Kroah-Hartman,et al.  Linux Device Drivers , 1998 .

[13]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[14]  Ján Veselý,et al.  Observations and opportunities in architecting shared virtual memory for heterogeneous systems , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[15]  Dan Bouvier,et al.  Energy efficient graphics and multimedia in 28NM Carrizo APU , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[16]  Ben Sander,et al.  Applying AMD's Kaveri APU for heterogeneous computing , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[17]  Andrew Siegel,et al.  XSBENCH - THE DEVELOPMENT AND VERIFICATION OF A PERFORMANCE ABSTRACTION FOR MONTE CARLO REACTOR ANALYSIS , 2014 .

[18]  Michael Stumm,et al.  FlexSC: Flexible System Call Scheduling with Exception-Less System Calls , 2010, OSDI.

[19]  Sudhakar Yalamanchili,et al.  Coordinated energy management in heterogeneous processors , 2014, Sci. Program..

[20]  Indrani Paul,et al.  Understanding idle behavior and power gating mechanisms in the context of modern benchmarks on CPU-GPU Integrated systems , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[21]  Shinobu Nagayama,et al.  Hardware Accelerators for Regular Expression Matching and Approximate String Matching , 2009 .

[22]  Andrew W. Appel,et al.  Virtual memory primitives for user programs , 1991, ASPLOS IV.

[23]  Per Hammarlund,et al.  4th generation Intel™ Core processor, codenamed Haswell , 2013, 2013 IEEE Hot Chips 25 Symposium (HCS).

[24]  Craig M. Wittenbrink,et al.  NVIDIA'S Tegra K1 system-on-chip , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[25]  David A. Wood,et al.  Crossing Guard: Mediating Host-Accelerator Coherence Interactions , 2017, ASPLOS.

[26]  Antonio J. Peña,et al.  Chai: Collaborative heterogeneous applications for integrated-architectures , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[27]  Mark Silberstein,et al.  GPUrdma: GPU-side library for high performance networking from GPU kernels , 2016, ROSS@HPDC.

[28]  Simha Sethumadhavan,et al.  Security Implications of Third-Party Accelerators , 2016, IEEE Computer Architecture Letters.

[29]  Idit Keidar,et al.  GPUfs: Integrating a file system with GPUs , 2013, TOCS.

[30]  K. K. Ramakrishnan,et al.  Eliminating receive livelock in an interrupt-driven kernel , 1996, TOCS.

[31]  Jeffrey C. Mogul,et al.  TCP Offload Is a Dumb Idea Whose Time Has Come , 2003, HotOS.

[32]  C. Genest,et al.  Everything You Always Wanted to Know about Copula Modeling but Were Afraid to Ask , 2007 .

[33]  Ana Lucia Varbanescu,et al.  KMA: A Dynamic Memory Manager for OpenCL , 2014, GPGPU@ASPLOS.

[34]  Todd M. Austin,et al.  A case for unlimited watchpoints , 2012, ASPLOS XVII.

[35]  Kevin Skadron,et al.  Pannotia: Understanding irregular GPGPU graph applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[36]  Silvio Savarese,et al.  EVA: An efficient vision architecture for mobile systems , 2013, 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[37]  Sumti Jairath,et al.  Next generation SPARC processor cache hierarchy , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[38]  Irfan Ahmad,et al.  vIC: Interrupt Coalescing for Virtual Machine Storage Device IO , 2011, USENIX Annual Technical Conference.

[39]  Sudhakar Yalamanchili,et al.  Cooperative boosting: needy versus greedy power management , 2013, ISCA.

[40]  Mayank Daga,et al.  Exploiting Coarse-Grained Parallelism in B+ Tree Searches on an APU , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[41]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[42]  Xiangyu Li,et al.  Hetero-mark, a benchmark suite for CPU-GPU collaborative computing , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[43]  Mark Silberstein,et al.  GPUnet , 2014, OSDI.

[44]  Thomas F. Wenisch,et al.  HARE: Hardware accelerator for regular expressions , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[45]  Jeffrey Stuecheli,et al.  CAPI: A Coherent Accelerator Processor Interface , 2015, IBM J. Res. Dev..