Analytical Study on Bandwidth Efficiency of Heterogeneous Memory Systems

Heterogeneous memory systems integrate different memory technologies to balance design requirements such as bandwidth, capacity, and cost. Performance of these systems depends heavily on memory hierarchy organization, memory attributes, and application characteristics. In this paper, we present analytical bandwidth models for a range of heterogeneous memory systems composed of DRAM and non-volatile memory (NVM). Our models enable exploring heterogeneous memory systems with different organizations and attributes. Using the models, we study the bandwidth efficiency of heterogeneous memory systems to provide insights into the bandwidth bottlenecks of these systems under different application characteristics. Our analytical results highlight the importance of NVM read-write bandwidth asymmetry and DRAM-NVM bandwidth asymmetry in bandwidth efficiency. Specifically, in flat non-uniform memory access (NUMA) systems, the read bandwidth is maximized when a certain portion of bandwidth is delivered by DRAM and that portion depends on multiple factors including DRAM and NVM bandwidth attributes and application bandwidth characteristics. In DRAM-cache-based systems, when the hit rate is low, the impact of the DRAM cache organization on the read bandwidth is minimal. However, at higher hit rates and NVM bandwidths, the impact of the cache organization on sustained read bandwidth becomes pronounced.

[1]  J. Jeddeloh,et al.  Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[2]  Stephen W. Keckler,et al.  Page Placement Strategies for GPUs within Heterogeneous Memory Systems , 2015, ASPLOS.

[3]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[4]  Alex Ramírez,et al.  Designing Efficient Heterogeneous Memory Architectures , 2015, IEEE Micro.

[5]  Babak Falsafi,et al.  Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[6]  Aamer Jaleel,et al.  CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[7]  Jinkyu Jeong,et al.  A fully associative, tagless DRAM cache , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[8]  Luis A. Lastras,et al.  PreSET: Improving performance of phase change memories by exploiting asymmetry in write times , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[9]  Ken Kennedy,et al.  Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..

[10]  Yifeng Zhu,et al.  Accelerating write by exploiting PCM asymmetries , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[11]  Mark D. Hill,et al.  Supporting Very Large DRAM Caches with Compound-Access Scheduling and MissMap , 2012, IEEE Micro.

[12]  Gabriel H. Loh,et al.  Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[13]  Eduard Ayguadé,et al.  Another Trip to the Wall: How Much Will Stacked DRAM Benefit HPC? , 2015, MEMSYS.

[14]  Mark D. Hill,et al.  Efficiently enabling conventional block sizes for very large die-stacked DRAM caches , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Mehrzad Samadi,et al.  Memory-centric system interconnect design with hybrid memory cubes , 2013, PACT 2013.

[16]  Avinash Sodani,et al.  Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[17]  David Roberts,et al.  NMI: A new memory interface to enable innovation , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[18]  Aamer Jaleel,et al.  BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[19]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[20]  Jaeha Kim,et al.  Memory-centric system interconnect design with Hybrid Memory Cubes , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[21]  Babak Falsafi,et al.  Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.

[22]  Tao Zhang,et al.  Overcoming the challenges of crossbar resistive memory architectures , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).