HAM: Hotspot-Aware Manager for Improving Communications With 3D-Stacked Memory

Emerging High-Performance Computing (HPC) workloads, such as graph analytics, machine learning, and big data science, are data-intensive. Data-intensive workloads usually present fine-grained memory accesses with limited or no data locality, and thus incur frequent cache misses and low utilization of memory bandwidth. 3D-stacked memory devices such as Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM) can provide significantly higher bandwidth than conventional memory modules. However, the traditional interfaces and optimization methods for JEDEC DDR devices do not allow to fully exploit the potential performance of 3D-stacked memory with the massive amount of irregular memory accesses of data-intensive applications. In this article, we propose a novel Hotspot-Aware Manager (HAM) infrastructure for 3D-stacked memory devices capable of optimizing memory access streams via request aggregation, hotspot detection, and in-memory prefetching. We present the HAM design and implementation, and simulate it on a system using RISC-V embedded cores with attached HMC devices. We extensively evaluate HAM with over 12 benchmarks and applications representing diverse irregular memory access patterns. The results show that, on average, HAM reduces redundant requests by 37.51 percent and increases the prefetch buffer hit rate by 4.2 times, compared to a baseline streaming prefetcher. On the selected benchmark set, HAM provides performance gains of 21.81 percent in average (up to 34.28 percent), and power savings of 35.07 percent over a standard 3D-stacked memory.

[1]  David R. Kaeli,et al.  Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[2]  Lizy Kurian John,et al.  Minimalist open-page: A DRAM page-mode scheduling policy for the many-core era , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Christoforos E. Kozyrakis,et al.  Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[4]  Srinivas Devadas,et al.  IMP: Indirect memory prefetcher , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Margaret Martonosi,et al.  TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs , 2013, TACO.

[6]  P. Sadayappan,et al.  Characterizing and enhancing global memory data coalescing on GPUs , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[7]  Paul Rosenfeld,et al.  Performance Exploration of the Hybrid Memory Cube , 2014 .

[8]  Seth H. Pugsley,et al.  Perceptron-Based Prefetch Filtering , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[9]  Alejandro Duran,et al.  Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP , 2009, 2009 International Conference on Parallel Processing.

[10]  David Roberts,et al.  Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[11]  Fabio Checconi,et al.  A Throughput-Optimized Optical Network for Data-Intensive Computing , 2014, IEEE Micro.

[12]  Reena Panda,et al.  HALO: A Hierarchical Memory Access Locality Modeling Technique For Memory System Explorations , 2018, ICS.

[13]  Ramyad Hadidi,et al.  GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[14]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[15]  Kevin Skadron,et al.  Dymaxion: Optimizing memory access patterns for heterogeneous systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[16]  Sam Ainsworth,et al.  Software prefetching for indirect memory accesses , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[17]  Bahar Asgari,et al.  Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube , 2017, 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[18]  Onur Mutlu,et al.  The Blacklisting Memory Scheduler: Balancing Performance, Fairness and Complexity , 2015, ArXiv.

[19]  Calvin Lin,et al.  Linearizing irregular memory accesses for improved correlated prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  David H. Bailey,et al.  NAS parallel benchmark results , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[21]  Maya Gokhale,et al.  Hybrid memory cube performance characterization on data-centric workloads , 2015, IA3@SC.

[22]  Rajeev Balasubramonian,et al.  Managing DRAM Latency Divergence in Irregular GPGPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Zhichun Zhu,et al.  CAMPS: Conflict-Aware Memory-Side Prefetching Scheme for Hybrid Memory Cube , 2018, ICPP.

[24]  Janak H. Patel,et al.  Stride directed prefetching in scalar processors , 1992, MICRO.

[25]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[26]  Richard W. Vuduc,et al.  Many-Thread Aware Prefetching Mechanisms for GPGPU Applications , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[27]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[28]  Rahul Boyapati,et al.  Active-Routing: Compute on the Way for Near-Data Processing , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[29]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[30]  Marc Casas,et al.  Data Prefetching on In-order Processors , 2018, 2018 International Conference on High Performance Computing & Simulation (HPCS).

[31]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[32]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[33]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[34]  Mahmut T. Kandemir,et al.  Meeting midway: Improving CMP performance with memory-side prefetching , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[35]  Yong Chen,et al.  HMC-Sim: A Simulation Framework for Hybrid Memory Cube Devices , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[36]  Krishna M. Kavi,et al.  HBM-Resident Prefetching for Heterogeneous Memory System , 2017, ARCS.

[37]  William J. Dally,et al.  Architecting an Energy-Efficient DRAM System for GPUs , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[38]  Scott A. Mahlke,et al.  WarpPool: Sharing requests with inter-warp coalescing for throughput processors , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[39]  Satish Narayanasamy,et al.  InvisiMem: Smart memory defenses for memory bus side channel , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[40]  Sudhakar Yalamanchili,et al.  Demystifying the characteristics of 3D-stacked memories: A case study for Hybrid Memory Cube , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).