PAC: Paged Adaptive Coalescer for 3D-Stacked Memory

Many contemporary data-intensive applications exhibit irregular and highly concurrent memory access patterns and thus challenge the performance of conventional memory systems. Driven by an expanding need for high-bandwidth memory featuring low access latency, 3D-stacked memory devices, such as the Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM), were designed to provide significantly higher throughput as compared to standard JEDEC DDR devices. However, existing memory interfaces and coalescing models, designed for conventional DDR devices, are unable to fully exploit the bandwidth potential inherent in these new 3D-stacked memory devices. In order to remedy this disparity, we introduce in this work a novel paged adaptive coalescer (PAC) infrastructure with a scalable coalescing network for 3D-stacked memory. We present the design and simulated implementation of this approach on RISC-V embedded cores with attached HMC devices. We have carried out extensive evaluations and the results show that the proposed PAC methodology yields an average coalescing efficiency of 56.01%. Further, our evaluation results also show that the PAC reduces bank conflicts and the power consumption by 85.16% and 59.21%, respectively. Overall, PAC achieves an average performance gain of 14.35% (and up to 26.06%) across 14 test suites. These results showcase the potential of the PAC methodology as applied to architecture design for increasingly critical data-intensive algorithms and applications.

[1]  William J. Dally,et al.  Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Krste Asanovic,et al.  The RISC-V Instruction Set Manual Volume 2: Privileged Architecture Version 1.7 , 2015 .

[3]  Sandia Report,et al.  Toward a New Metric for Ranking High Performance Computing Systems , 2013 .

[4]  Sudhakar Yalamanchili,et al.  Demystifying the characteristics of 3D-stacked memories: A case study for Hybrid Memory Cube , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[5]  David H. Bailey,et al.  NAS parallel benchmark results , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[6]  Scott A. Mahlke,et al.  WarpPool: Sharing requests with inter-warp coalescing for throughput processors , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[8]  Doe Hyun Yoon,et al.  Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[9]  Aamer Jaleel,et al.  CoLT: Coalesced Large-Reach TLBs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[10]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[11]  Christoforos E. Kozyrakis,et al.  GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[12]  Mingyu Gao,et al.  HRL: Efficient and flexible reconfigurable logic for near-data processing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[13]  Antonino Tumeo,et al.  MAC: Memory Access Coalescer for 3D-Stacked Memory , 2019, ICPP.

[14]  J. Jeddeloh,et al.  Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[15]  Jaeha Kim,et al.  Memory-centric system interconnect design with Hybrid Memory Cubes , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[16]  Josep Torrellas,et al.  Scalable Cache Miss Handling for High Memory-Level Parallelism , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[17]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[18]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[19]  Abhishek Bhattacharjee,et al.  Large-reach memory management unit caches , 2013, MICRO.

[20]  Yong Chen,et al.  HMC-Sim: A Simulation Framework for Hybrid Memory Cube Devices , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[21]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[22]  Margaret Martonosi,et al.  TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs , 2013, TACO.

[23]  Richard W. Vuduc,et al.  Many-Thread Aware Prefetching Mechanisms for GPGPU Applications , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[24]  Marc Casas,et al.  Data Prefetching on In-order Processors , 2018, 2018 International Conference on High Performance Computing & Simulation (HPCS).

[25]  Ramyad Hadidi,et al.  GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[26]  Satish Narayanasamy,et al.  InvisiMem: Smart memory defenses for memory bus side channel , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[27]  Thomas F. Wenisch,et al.  Selective GPU caches to eliminate CPU-GPU HW cache coherence , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[28]  David A. Bader,et al.  Design and Implementation of the HPCS Graph Analysis Benchmark on Symmetric Multiprocessors , 2005, HiPC.

[29]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[30]  Maya Gokhale,et al.  Hybrid memory cube performance characterization on data-centric workloads , 2015, IA3@SC.

[31]  Dongrui Fan,et al.  High performance comparison-based sorting algorithm on many-core GPUs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[32]  Mateo Valero,et al.  VSR sort: A novel vectorised sorting algorithm & architecture extensions for future microprocessors , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[33]  Tejas Karkhanis,et al.  Active Memory Cube: A processing-in-memory architecture for exascale systems , 2015, IBM J. Res. Dev..

[34]  Christoforos E. Kozyrakis,et al.  Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[35]  Fabio Checconi,et al.  A Throughput-Optimized Optical Network for Data-Intensive Computing , 2014, IEEE Micro.

[36]  Dong Chen,et al.  Locality analysis through static parallel sampling , 2018, PLDI.

[37]  Kevin Skadron,et al.  Dymaxion: Optimizing memory access patterns for heterogeneous systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[38]  Paul Rosenfeld,et al.  Performance Exploration of the Hybrid Memory Cube , 2014 .

[39]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[40]  Rajeev Balasubramonian,et al.  MemZip: Exploring unconventional benefits from memory compression , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[41]  Alejandro Duran,et al.  Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP , 2009, 2009 International Conference on Parallel Processing.

[42]  William J. Dally,et al.  Architecting an Energy-Efficient DRAM System for GPUs , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[43]  Gabriel Zachmann,et al.  GPU-ABiSort: optimal parallel sorting on stream architectures , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[44]  David A. Wood,et al.  Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[45]  Gabriel H. Loh,et al.  Increasing TLB reach by exploiting clustering in page translations , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[46]  Bahar Asgari,et al.  Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube , 2017, 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[47]  David R. Kaeli,et al.  Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[48]  John D. Leidel,et al.  Memory Coalescing for Hybrid Memory Cube , 2018, ICPP.

[49]  Valentin Puente,et al.  CMP off-chip bandwidth scheduling guided by instruction criticality , 2013, ICS '13.