Selective DRAM cache bypassing for improving bandwidth on DRAM/NVM hybrid main memory systems

Satisfying a demand for higher memory capacity is a major problem for computing systems. Conventional solutions are reaching those limits; instead, DRAM/NVM hybrid main memory systems which consist of emerging Non-Volatile Memory for large capacity and DRAM last-level cache for high access speed were proposed for further improvement. However, in these systems, the two device types share limited memory channels/ranks and NVM channels/ranks are often less utilized than DRAM ones. This paper proposes an OBYST (On hit BY pass to STeal bandwidth) technique to improve memory bandwidth by selectively sending read requests that hit on DRAM cache to NVM instead of busy DRAM. We also propose an inter-device request scheduling policy optimized to OBYST. With negligible area overhead, OBYST improves bandwidth, IPC, and EDP by up to 22%, 21%, and 26% over the baseline without bandwidth optimizations, respectively.

[1]  Hongzhong Zheng,et al.  Power and Performance Trade-Offs in Contemporary DRAM System Designs for Multicore Processors , 2010, IEEE Transactions on Computers.

[2]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[4]  Vijayalakshmi Srinivasan,et al.  Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[5]  Parag Agrawal,et al.  The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[6]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[7]  Gustavo Alonso,et al.  Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[8]  Mark D. Hill,et al.  Efficiently enabling conventional block sizes for very large die-stacked DRAM caches , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Hye-Jin Kim,et al.  A 90nm 1.8V 512Mb Diode-Switch PRAM with 266MB/s Read Throughput , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[10]  Cheng-Chieh Huang,et al.  ATCache: Reducing DRAM cache latency via a small SRAM tag cache , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[11]  Mohammad Arjomand,et al.  Reducing access latency of MLC PCMs through line striping , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[12]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[13]  O Seongil,et al.  McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[14]  Gabriel H. Loh,et al.  A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[15]  Tao Li,et al.  Exploring high-performance and energy proportional interface for phase change memory systems , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[16]  Qi Wang,et al.  A 20nm 1.8V 8Gb PRAM with 40MB/s program bandwidth , 2012, 2012 IEEE International Solid-State Circuits Conference.

[17]  Byung-Gil Choi,et al.  A 90 nm 1.8 V 512 Mb Diode-Switch PRAM With 266 MB/s Read Throughput , 2008, IEEE Journal of Solid-State Circuits.