A performance & power comparison of modern high-speed DRAM architectures

To feed the high degrees of parallelism in modern graphics processors and manycore CPU designs, DRAM manufacturers have created new DRAM architectures that deliver high bandwidth. This paper presents a simulation-based study of the most common forms of DRAM today: DDR3, DDR4, and LPDDR4 SDRAM; GDDR5 SGRAM; and two recent 3D-stacked architectures: High Bandwidth Memory (HBM1, HBM2), and Hybrid Memory Cube (HMC1, HMC2). Our simulations give both time and power/energy results and reveal several things: (a) current multi-channel DRAM technologies have succeeded in translating bandwidth into better execution time for all applications, turning memory-bound applications into compute-bound; (b) the inherent parallelism in the memory system is the critical enabling factor (high bandwidth alone is insufficient); (c) while all current DRAM architectures have addressed the memory-bandwidth problem, the memory-latency problem does still remain, dominated by queuing delays arising from lack of parallelism; and (d) the majority of power and energy is spent in the I/O interface, driving bits across the bus; DRAM-specific overhead beyond bandwidth has been reduced significantly, which is great news (an ideal memory technology would dissipate power only in bandwidth, all else would be free).

[1]  Mehrzad Samadi,et al.  Memory-centric system interconnect design with hybrid memory cubes , 2013, PACT 2013.

[2]  Norman P. Jouppi,et al.  Rethinking DRAM design and organization for energy-constrained multi-cores , 2010, ISCA.

[3]  Jinkyu Jeong,et al.  A fully associative, tagless DRAM cache , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[4]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[5]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[6]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[7]  Natalie D. Enright Jerger,et al.  Achieving predictable performance through better memory controller placement in many-core CMPs , 2009, ISCA '09.

[8]  Mark Oskin,et al.  A Software-Managed Approach to Die-Stacked DRAM , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[9]  Lizy Kurian John,et al.  Minimalist open-page: A DRAM page-mode scheduling policy for the many-core era , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10]  Darko Živanovič Memory systems for high-performance computing: the capacity and reliability implications , 2018 .

[11]  William J. Dally,et al.  Architecting an Energy-Efficient DRAM System for GPUs , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[12]  Dongdong Li,et al.  Inter-Core Locality Aware Memory Scheduling , 2016, IEEE Computer Architecture Letters.

[13]  Lei Liu,et al.  BPM/BPM+: Software-based dynamic memory partitioning mechanisms for mitigating DRAM bank-/channel-level interferences in multicore systems , 2014, TACO.

[14]  Sandia Report,et al.  Toward a New Metric for Ranking High Performance Computing Systems , 2013 .

[15]  Onur Mutlu,et al.  The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Onur Mutlu,et al.  Improving DRAM performance by parallelizing refreshes with accesses , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[17]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[18]  Trevor N. Mudge,et al.  A performance comparison of contemporary DRAM architectures , 1999, ISCA.

[19]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[20]  David Roberts,et al.  Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[21]  Jose Renau,et al.  Effective Optimistic-Checker Tandem Core Design through Architectural Pruning , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[22]  Paul Rosenfeld,et al.  Performance Exploration of the Hybrid Memory Cube , 2014 .

[23]  Eduard Ayguadé,et al.  Another Trip to the Wall: How Much Will Stacked DRAM Benefit HPC? , 2015, MEMSYS.

[24]  Tao Zhang,et al.  CREAM: A Concurrent-Refresh-Aware DRAM Memory architecture , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[25]  Jaeha Kim,et al.  Memory-centric system interconnect design with Hybrid Memory Cubes , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[26]  Bruce Jacob,et al.  Buffer-on-board memory systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[27]  Sadagopan Srinivasan Prefetching Vs The Memory System : Optimizations for Multi-core Server Platforms , 2007 .

[28]  Alaa R. Alameldeen,et al.  Transparent Hardware Management of Stacked DRAM as Part of Memory , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[29]  Vijayalakshmi Srinivasan,et al.  Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[30]  Bruce Jacob,et al.  Concurrency, latency, or system overhead: which has the largest impact on uniprocessor DRAM-system performance? , 2001, ISCA 2001.

[31]  Jaehyuk Huh,et al.  Reducing the Memory Bandwidth Overheads of Hardware Security Support for Multi-Core Processors , 2016, IEEE Transactions on Computers.

[32]  Aamer Jaleel,et al.  DRAMsim: a memory system simulator , 2005, CARN.

[33]  Aamer Jaleel,et al.  Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[34]  Babak Falsafi,et al.  BuMP: Bulk Memory Access Prediction and Streaming , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[35]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[36]  Mahmut T. Kandemir,et al.  Evaluating STT-RAM as an energy-efficient main memory alternative , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[37]  Bruce Jacob,et al.  Fine-Grained Activation for Power Reduction in DRAM , 2010, IEEE Micro.

[38]  Natalie D. Enright Jerger,et al.  Evaluating the memory system behavior of smartphone workloads , 2014, 2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV).

[39]  Daisuke Takahashi,et al.  The HPC Challenge (HPCC) benchmark suite , 2006, SC.

[40]  Rajeev Balasubramonian,et al.  Managing DRAM Latency Divergence in Irregular GPGPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[41]  Radu Sion,et al.  DIMMer: A case for turning off DIMMs in clouds , 2014, SoCC.

[42]  Carole-Jean Wu,et al.  Characterization and Throttling-Based Mitigation of Memory Interference for Heterogeneous Smartphones , 2015, 2015 IEEE International Symposium on Workload Characterization.

[43]  Hsien-Hsin S. Lee,et al.  Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[44]  D. Burger,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[45]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[46]  Mary Lou Soffa,et al.  DraMon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[47]  Tor M. Aamodt,et al.  Complexity effective memory access scheduling for many-core accelerator architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[48]  Steven Przybylski,et al.  New DRAM Technologies: A Comprehensive Analysis of the New Architecture , 1994 .