HBM-Resident Prefetching for Heterogeneous Memory System

To meet the increasing demands for very large memory capacities, bandwidth and energy efficiency, researchers are exploring the use of heterogeneous memory systems that combine faster 3D-DRAMs, DDRx DRAM and non-volatile memories (NVMs). In this paper we evaluate prefetching in a flat-addressable heterogeneous memory comprising High Bandwidth Memory (HBM) and phase change memory (PCM). We find that large prefetch buffers (64 MB) can outperform smaller buffer sizes (2 MB), however it is not feasible to place such large buffers on the processor die. Hence, in this paper we evaluate an HBM-resident prefetch buffer that provides larger capacity and takes advantage of HBM’s higher memory bandwidth. We also present new prefetching policies that accommodate the differences in data path as compared to traditional prefetchers. We show that, reserving a small fraction (1/16th) of HBM memory to host a hardware prefetch buffer can improve IPC for a set of SPEC CPU2006 and HPC benchmarks by an average of 34% and a maximum of 98% over a baseline system with no-prefetching. Prefetching reduces total PCM traffic by 10% on average, which results in more memory traffic to the faster HBM, providing overall performance improvement. We found that such prfetching outperforms CAMEO and Alloy cache schemes on average by 60% and 10%, respectively.

[1]  Aamer Jaleel,et al.  CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[2]  Onur Mutlu,et al.  Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[3]  Onur Mutlu,et al.  Efficient Data Mapping and Buffering Techniques for Multilevel Cell Phase-Change Memories , 2014, ACM Trans. Archit. Code Optim..

[4]  Krishna M. Kavi,et al.  Prefetching as a Potentially Effective Technique for Hybrid Memory Optimization , 2016, MEMSYS.

[5]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[6]  Onur Mutlu,et al.  Memory scaling: A systems architecture perspective , 2013, 2013 5th IEEE International Memory Workshop.

[7]  Hao Wang,et al.  DUANG: Fast and lightweight page migration in asymmetric memory systems , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[8]  Kiyoung Choi,et al.  Low-Power Hybrid Memory Cubes With Link Power Management and Two-Level Prefetching , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[9]  Mark Oskin,et al.  A Software-Managed Approach to Die-Stacked DRAM , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[10]  Sudhanva Gurumurthi,et al.  Phase Change Memory: From Devices to Systems , 2011, Phase Change Memory.

[11]  Bronis R. de Supinski,et al.  HpMC: An Energy-aware Management System of Multi-level Memory Architectures , 2015, MEMSYS.

[12]  Nathan Beckmann,et al.  Jigsaw: Scalable software-defined caches , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[13]  Janak H. Patel,et al.  Stride directed prefetching in scalar processors , 1992, MICRO.

[14]  David Roberts,et al.  Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[15]  Krishna M. Kavi,et al.  Moola : Multicore Cache Simulator , 2015 .

[16]  Anand Sivasubramaniam,et al.  Going the distance for TLB prefetching: an application-driven study , 2002, ISCA.

[17]  Joonyoung Kim,et al.  HBM: Memory solution for bandwidth-hungry processors , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[18]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[19]  Moinuddin K. Qureshi,et al.  Reducing read latency of phase change memory via early read and Turbo Read , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[20]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[21]  Mahmut T. Kandemir,et al.  Meeting midway: Improving CMP performance with memory-side prefetching , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[22]  K QureshiMoinuddin,et al.  Scalable high performance main memory system using phase-change memory technology , 2009 .

[23]  Yan Solihin,et al.  CHOP: Adaptive filter-based DRAM caching for CMP server platforms , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[24]  Gabriel H. Loh,et al.  Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[25]  Rachata Ausavarungnirun,et al.  Row buffer locality aware caching policies for hybrid memories , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[26]  Alaa R. Alameldeen,et al.  Transparent Hardware Management of Stacked DRAM as Part of Memory , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[27]  Babak Falsafi,et al.  Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.