Opportunistic Computing in GPU Architectures

Data transfer overhead between computing cores and memory hierarchy has been a persistent issue for von Neumann architectures and the problem has only become more challenging with the emergence of manycore systems. A conceptually powerful approach to mitigate this overhead is to bring the computation closer to data, known as Near Data Computing (NDC). Recently, NDC has been investigated in different flavors for CPU-based multicores, while the GPU domain has received little attention. In this paper, we present a novel NDC solution for GPU architectures with the objective of minimizing on-chip data transfer between the computing cores and Last-Level Cache (LLC). To achieve this, we first identify frequently occurring Load-Compute- Store instruction chains in GPU applications. These chains, when offloaded to a compute unit closer to where the data resides, can significantly reduce data movement. We develop two offloading techniques, called LLC-Compute and Omni- Compute. The first technique, LLC-Compute, augments the LLCs with computational hardware for handling the computation offloaded to them. The second technique (Omni- Compute) employs simple bookkeeping hardware to enable GPU cores to compute instructions offloaded by other GPU cores. Our experimental evaluations on nine GPGPU workloads indicate that the LLC-Compute technique provides, on an average, 19% performance improvement (IPC), 11% per- formance/watt improvement, and 29% reduction in on-chip data movement compared to the baseline GPU design. The Omni-Compute design boosts these benefits to 31%, 16% and 44%, respectively.

[1]  Indrani Paul,et al.  Achieving Exascale Capabilities through Heterogeneous Computing , 2015, IEEE Micro.

[2]  Mahmut T. Kandemir,et al.  Race-To-Sleep + Content Caching + Display Caching: A Recipe for Energy-efficient Video Streaming on Handhelds , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Wu-chun Feng,et al.  MultiCL: Enabling automatic scheduling for task-parallel workloads in OpenCL , 2016, Parallel Comput..

[4]  Alexander Sprintson,et al.  GCA: Global congestion awareness for load balance in Networks-on-Chip , 2013, 2013 Seventh IEEE/ACM International Symposium on Networks-on-Chip (NoCS).

[5]  Onur Mutlu,et al.  Accelerating Dependent Cache Misses with an Enhanced Memory Controller , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[6]  Harold S. Stone,et al.  A Logic-in-Memory Computer , 1970, IEEE Transactions on Computers.

[7]  Nigel P. Topham,et al.  Characterizing memory bottlenecks in GPGPU workloads , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[8]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[9]  Mahmut T. Kandemir,et al.  μC-States: Fine-grained GPU datapath power management , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[10]  John Kim,et al.  Throughput-Effective On-Chip Networks for Manycore Accelerators , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[11]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[12]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[13]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[14]  Onur Mutlu,et al.  Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[15]  Mahmut T. Kandemir,et al.  Data Movement Aware Computation Partitioning , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[17]  A. Gottleib,et al.  The nyu ultracomputer- designing a mimd shared memory parallel computer , 1983 .

[18]  Ramyad Hadidi,et al.  CAIRO , 2017, ACM Trans. Archit. Code Optim..

[19]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[20]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[21]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[22]  Mahmut T. Kandemir,et al.  Scheduling techniques for GPU architectures with processing-in-memory capabilities , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[23]  Jung Ho Ahn,et al.  DRAMA: An Architecture for Accelerated Processing Near Memory , 2015, IEEE Computer Architecture Letters.

[24]  Kevin Skadron,et al.  Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.

[25]  Mahmut T. Kandemir,et al.  Managing GPU Concurrency in Heterogeneous Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[26]  Peter M. Kogge,et al.  EXECUBE-A New Architecture for Scaleable MPPs , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[27]  Chita R. Das,et al.  A case for heterogeneous on-chip interconnects for CMPs , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[28]  Mahmut T. Kandemir,et al.  Anatomy of GPU Memory System for Multi-Application Execution , 2015, MEMSYS.

[29]  Mahmut T. Kandemir,et al.  Characterizing diverse handheld apps for customized hardware acceleration , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[30]  Mahmut T. Kandemir,et al.  Addressing End-to-End Memory Access Latency in NoC-Based Multicores , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[31]  Nan Jiang,et al.  Packet chaining: Efficient single-cycle allocation for on-chip networks , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[32]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[33]  Chita R. Das,et al.  OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[34]  Chun Chen,et al.  The architecture of the DIVA processing-in-memory chip , 2002, ICS '02.

[35]  George Kesidis,et al.  Spock: Exploiting Serverless Functions for SLO and Cost Aware Resource Procurement in Public Cloud , 2019, 2019 IEEE 12th International Conference on Cloud Computing (CLOUD).

[36]  Tor M. Aamodt,et al.  Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[37]  Mahmut T. Kandemir,et al.  Understanding Energy Efficiency in IoT App Executions , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[38]  Jinchun Kim,et al.  Bandwidth-efficient on-chip interconnect designs for GPGPUs , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[39]  Xipeng Shen,et al.  On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.

[40]  Mahmut T. Kandemir,et al.  Exploiting Core Criticality for Enhanced GPU Performance , 2016, SIGMETRICS.

[41]  Ralph Grishman,et al.  The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.

[42]  Rachata Ausavarungnirun,et al.  Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks , 2018, ASPLOS.

[43]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[44]  Kyung Hoon Kim,et al.  Packet coalescing exploiting data redundancy in GPGPU architectures , 2017, ICS.

[45]  Mahmut T. Kandemir,et al.  Quantifying Data Locality in Dynamic Parallelism in GPUs , 2018, Proc. ACM Meas. Anal. Comput. Syst..

[46]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[47]  David R. Kaeli,et al.  Asymmetric NoC Architectures for GPU Systems , 2015, NOCS.

[48]  Bill Lin,et al.  Destination-based adaptive routing on 2D mesh networks , 2010, 2010 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS).

[49]  G.J. Minden,et al.  A survey of active network research , 1997, IEEE Communications Magazine.

[50]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[51]  Ramyad Hadidi,et al.  GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[52]  William J. Dally,et al.  Route packets, not wires: on-chip inteconnection networks , 2001, DAC '01.

[53]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[54]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[55]  Wu-chun Feng,et al.  Measuring and modeling on-chip interconnect power on real hardware , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[56]  Mike O'Connor,et al.  Divergence-Aware Warp Scheduling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[57]  Sudhakar Yalamanchili,et al.  Understanding Energy Aspects of Processing-near-Memory for HPC Workloads , 2015, MEMSYS.

[58]  Wu-chun Feng,et al.  Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL , 2015, 2015 IEEE International Conference on Cluster Computing.

[59]  Mahmut T. Kandemir,et al.  CritICs Critiquing Criticality in Mobile Apps , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[60]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[61]  Nigel P. Topham,et al.  Evaluating and mitigating bandwidth bottlenecks across the memory hierarchy in GPUs , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[62]  Gabriel H. Loh Nuwan Jayasena Mark H. Oskin Mark Nutter Da Ignatowski A Processing-in-Memory Taxonomy and a Case for Studying Fixed-function PIM , 2013 .

[63]  Mahmut T. Kandemir,et al.  Controlled Kernel Launch for Dynamic Parallelism in GPUs , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[64]  Chen Sun,et al.  DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[65]  R. Govindarajan,et al.  Taming warp divergence , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[66]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[67]  Mahmut T. Kandemir,et al.  Phoenix: A Constraint-Aware Scheduler for Heterogeneous Datacenters , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).