Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems

Future GPUs and other high-performance throughput processors will require multiple TB/s of bandwidth to DRAM. Satisfying this bandwidth demand within an acceptable energy budget is a challenge in these extreme bandwidth memory systems. We propose a new high-bandwidth DRAM architecture, Fine-Grained DRAM (FGDRAM), which improves bandwidth by 4× and improves the energy efficiency of DRAM by 2× relative to the highest-bandwidth, most energy-efficient contemporary DRAM, High Bandwidth Memory (HBM2). These benefits are in large measure achieved by partitioning the DRAM die into many independent units, called grains, each of which has a local, adjacent I/O. This approach unlocks the bandwidth of all the banks in the DRAM to be used simultaneously, eliminating shared buses interconnecting various banks. Furthermore, the on-DRAM data movement energy is significantly reduced due to the much shorter wiring distance between the cell array and the local I/O. This FGDRAM architecture readily lends itself to leveraging existing techniques to reducing the effective DRAM row size in an area efficient manner, reducing wasteful row activate energy in applications with low locality. In addition, when FGDRAM is paired with a memory controller optimized to exploit the additional concurrency provided by the independent grains, it improves GPU system performance by 19% over an iso-bandwidth and iso-capacity future HBM baseline. Thus, this energy-efficient, high-bandwidth FGDRAM architecture addresses the needs of future extreme-bandwidth memory systems. CCS CONCEPTS • Hardware → Dynamic memory; Power and energy; • Computing methodologies → Graphics processors; • Computer systems organization → Parallel architectures;

[1]  O Seongil,et al.  Defect Analysis and Cost-Effective Resilience Architecture for Future DRAM Devices , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[2]  Dick James,et al.  Recent innovations in DRAM manufacturing , 2010, 2010 IEEE/SEMI Advanced Semiconductor Manufacturing Conference (ASMC).

[3]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[4]  Xi Chen,et al.  A 0.54 pJ/b 20 Gb/s Ground-Referenced Single-Ended Short-Reach Serial Link in 28 nm CMOS for Advanced Packaging Applications , 2013, IEEE Journal of Solid-State Circuits.

[5]  Mircea R. Stan,et al.  Bus-invert coding for low-power I/O , 1995, IEEE Trans. Very Large Scale Integr. Syst..

[6]  Bruce Jacob,et al.  Fine-Grained Activation for Power Reduction in DRAM , 2010, IEEE Micro.

[7]  Feng Lin,et al.  DRAM Circuit Design: Fundamental and High-Speed Topics , 2007 .

[8]  Ben H. H. Juurlink,et al.  On latency in GPU throughput microarchitectures , 2015, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[9]  O Seongil,et al.  Microbank: Architecting Through-Silicon Interposer-Based Main Memory Systems , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Tao Zhang,et al.  Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[11]  Tero Karras,et al.  Architecture considerations for tracing incoherent rays , 2010, HPG '10.

[12]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[13]  William J. Dally,et al.  Architecting an Energy-Efficient DRAM System for GPUs , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[14]  J. Jeddeloh,et al.  Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[15]  Mark Horowitz,et al.  Improving energy efficiency of DRAM by exploiting half page row access , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  William J. Dally,et al.  Scaling the Power Wall: A Path to Exascale , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Mattan Erez,et al.  A locality-aware memory hierarchy for energy-efficient GPU architectures , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  T. Schloesser,et al.  6F2 buried wordline DRAM cell for 40nm and beyond , 2008, 2008 IEEE International Electron Devices Meeting.

[19]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[20]  Onur Mutlu,et al.  A case for exploiting subarray-level parallelism (SALP) in DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[21]  O Seongil,et al.  Row-buffer decoupling: A case for low-latency DRAM microarchitecture , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[22]  Norman P. Jouppi,et al.  Rethinking DRAM design and organization for energy-constrained multi-cores , 2010, ISCA.

[23]  Qawi Harvard,et al.  A scalable I/O architecture for wide I/O DRAM , 2011, 2011 IEEE 54th International Midwest Symposium on Circuits and Systems (MWSCAS).

[24]  John Shalf,et al.  HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems , 2014 .

[25]  Keshav Pingali,et al.  A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[26]  Hyeonggyu Kim,et al.  Partial Row Activation for Low-Power DRAM System , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[27]  Rajeev Balasubramonian,et al.  Managing DRAM Latency Divergence in Irregular GPGPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Thomas Vogelsang,et al.  Understanding the Energy Consumption of Dynamic Random Access Memories , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.