Prefetching as a Potentially Effective Technique for Hybrid Memory Optimization

The promise of 3D-stacked memory solving the memory wall has led to many emerging architectures that integrate 3D-stacked memory into processor memory in a variety of ways including systems that utilize different memory technologies, with different performance and power characteristics, to comprise the system memory. It then becomes necessary to manage these memories such that we get the performance of the fastest memory while having the capacity of the slower but larger memories. Some research in industry and academia proposed using 3D-stacked DRAM as a hardware managed cache. More recently, particularly pushed by the demands for ever larger capacities, researchers are exploring the use of multiple memory technologies as a single main memory. The main challenge for such flat-address-space memories is the placement and migration of memory pages to increase the number of requests serviced from faster memory, as well as managing overhead due to page migrations. In this paper we ask a different question: can traditional prefetching be a viable solution for effective management of hybrid memories? We conjecture that by tuning well-known prefetch mechanism for hybrid memories we can achieve substantial performance improvement. To test our conjecture, we compared the state of the art CAMEO migration policy with a Markov-like prefetcher for a hybrid memory consisting of HBM (3D-stacked DRAM) and Phase Change Memory (PCM) using a set of SPEC CPU2006 and several HPC benchmarks. We find that CAMEO provides better performance improvement than prefetching for 2/3rd of the workloads (by 59%) and prefetching is better than CAMEO for the remaining 1/3rd (by 19%). The EDP analysis shows that the prefetching solution improves EDP over the no-prefetching baseline whereas CAMEO does worse in terms of average EDP. These results indicate that prefetching should be reconsidered as a supplementary technique to data migration.

[1]  Vijayalakshmi Srinivasan,et al.  Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[2]  Janak H. Patel,et al.  Stride directed prefetching in scalar processors , 1992, MICRO.

[3]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[4]  Mark Oskin,et al.  A Software-Managed Approach to Die-Stacked DRAM , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[5]  Bronis R. de Supinski,et al.  HpMC: An Energy-aware Management System of Multi-level Memory Architectures , 2015, MEMSYS.

[6]  Sudhanva Gurumurthi,et al.  Phase Change Memory: From Devices to Systems , 2011, Phase Change Memory.

[7]  Aamer Jaleel,et al.  CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[8]  Gabriel H. Loh,et al.  Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[9]  Eric Pop,et al.  Phase change materials and phase change memory , 2014 .

[10]  Kiyoung Choi,et al.  Low-Power Hybrid Memory Cubes With Link Power Management and Two-Level Prefetching , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[11]  Onur Mutlu,et al.  Prefetch-Aware DRAM Controllers , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[12]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[13]  David Roberts,et al.  Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[14]  Onur Mutlu,et al.  Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[15]  R. Hornung,et al.  HYDRODYNAMICS CHALLENGE PROBLEM , 2011 .

[16]  Onur Mutlu,et al.  Efficient Data Mapping and Buffering Techniques for Multilevel Cell Phase-Change Memories , 2014, ACM Trans. Archit. Code Optim..

[17]  Joe Macri,et al.  AMD's next generation GPU and high bandwidth memory architecture: FURY , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[18]  Moinuddin K. Qureshi,et al.  Reducing read latency of phase change memory via early read and Turbo Read , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[19]  Anand Sivasubramaniam,et al.  Going the distance for TLB prefetching: an application-driven study , 2002, ISCA.

[20]  Rachata Ausavarungnirun,et al.  Row buffer locality aware caching policies for hybrid memories , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[21]  Alaa R. Alameldeen,et al.  Transparent Hardware Management of Stacked DRAM as Part of Memory , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[22]  Babak Falsafi,et al.  Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[23]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[24]  Joonyoung Kim,et al.  HBM: Memory solution for bandwidth-hungry processors , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[25]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[26]  Krishna M. Kavi,et al.  Memory organizations for 3D-DRAMs and PCMs in processor memory hierarchy , 2015, J. Syst. Archit..

[27]  Krishna M. Kavi,et al.  Moola : Multicore Cache Simulator , 2015 .