PIMS: a lightweight processing-in-memory accelerator for stencil computations

Stencil computation is a classic computational kernel present in many high-performance scientific applications, like image processing and partial differential equation solvers (PDE). A stencil computation sweeps over a multi-dimensional grid and repeatedly updates values associated with points using the values from neighboring points. Stencil computations often employ large datasets that exceed cache capacity, leading to excessive accesses to the memory subsystem. As such, 3D stencil computations on large grid sizes are memory-bound. In this paper we present PIMS, an in-memory accelerator for stencil computations. PIMS, implemented in the logic layer of a 3Dstacked memory, exploits the high bandwidth provided by throughsilicon vias to reduce redundant memory traffic. Our comprehensive evaluation using three different grid sizes with six categories of orders indicate that the proposed architecture reduces 48.25% of data movement on average and obtains up to 65.55% of bank conflict reduction.

[1]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[2]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  Luca Benini,et al.  High performance AXI-4.0 based interconnect for extensible smart memory cubes , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[4]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[5]  Antonino Tumeo,et al.  MAC: Memory Access Coalescer for 3D-Stacked Memory , 2019, ICPP.

[6]  Onur Mutlu,et al.  Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[7]  Gabriel H. Loh Nuwan Jayasena Mark H. Oskin Mark Nutter Da Ignatowski A Processing-in-Memory Taxonomy and a Case for Studying Fixed-function PIM , 2013 .

[8]  Christoph Hagleitner,et al.  An architecture for near-data processing systems , 2016, Conf. Computing Frontiers.

[9]  Richard Veras,et al.  A stencil compiler for short-vector SIMD architectures , 2013, ICS '13.

[10]  Michael A. Frumkin,et al.  Tight bounds on cache use for stencil operations on rectangular grids , 2002, JACM.

[11]  P.M. Kogge,et al.  Pursuing a petaflop: point designs for 100 TF computers using PIM technologies , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).

[12]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[13]  Gerhard Wellein,et al.  Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model , 2014, ICS.

[14]  Yong Chen,et al.  HMC-Sim: A Simulation Framework for Hybrid Memory Cube Devices , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[15]  Weiqiang Wang,et al.  In-Core Optimization of High-Order Stencil Computations , 2009, PDPTA.

[16]  Gerhard Wellein,et al.  Exploring performance and power properties of modern multi‐core chips via simple machine models , 2012, Concurr. Comput. Pract. Exp..

[17]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[18]  Dong Li,et al.  Performance Implications of Processing-in-Memory Designs on Data-Intensive Applications , 2016, 2016 45th International Conference on Parallel Processing Workshops (ICPPW).

[19]  Hans-Peter Seidel,et al.  Cache Accurate Time Skewing in Iterative Stencil Computations , 2011, 2011 International Conference on Parallel Processing.

[20]  Jung Ho Ahn,et al.  NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[21]  Tejas Karkhanis,et al.  Active Memory Cube: A processing-in-memory architecture for exascale systems , 2015, IBM J. Res. Dev..

[22]  Alexandros Labrinidis,et al.  Challenges and Opportunities with Big Data , 2012, Proc. VLDB Endow..

[23]  Dietmar Fey,et al.  High Performance Stencil Code Algorithms for GPGPUs , 2011, ICCS.

[24]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[25]  J. Jeddeloh,et al.  Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[26]  Ramyad Hadidi,et al.  GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[27]  Jose Renau,et al.  Programming the FlexRAM parallel intelligent memory system , 2003, PPoPP '03.

[28]  Christoforos E. Kozyrakis,et al.  Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[29]  Robert Strzodka,et al.  Impact of System and Cache Bandwidth on Stencil Computations Across Multiple Processor Generations , 2011 .

[30]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[31]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[32]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  Eduard Ayguadé,et al.  Another Trip to the Wall: How Much Will Stacked DRAM Benefit HPC? , 2015, MEMSYS.

[34]  Sudhakar Yalamanchili,et al.  Demystifying the characteristics of 3D-stacked memories: A case study for Hybrid Memory Cube , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[35]  Paul Rosenfeld,et al.  Performance Exploration of the Hybrid Memory Cube , 2014 .

[36]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[37]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[38]  M. Oskin,et al.  Active Pages: a computation model for intelligent memory , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[39]  Holger Fröning,et al.  Exploring Time and Energy for Complex Accesses to a Hybrid Memory Cube , 2016, MEMSYS.