NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules

Energy consumed for transferring data across the processor memory hierarchy constitutes a large fraction of total system energy consumption, and this fraction has steadily increased with technology scaling. In this paper, we propose near-DRAM acceleration (NDA) architectures, which process data using accelerators 3D-stacked on DRAM devices comprising off-chip main memory modules. NDA transfers most data through high-bandwidth and low-energy 3D interconnects between accelerators and DRAM devices instead of low-bandwidth and high-energy off-chip interconnects between a processor and DRAM devices, substantially reducing energy consumption and improving performance. Unlike previous near-memory processing architectures, NDA is built upon commodity DRAM devices; apart from inserting through-silicon vias (TSVs) to 3D-interconnect DRAM devices and accelerators, NDA requires minimal changes to the commodity DRAM device and standard memory module architectures. This allows NDA to be more easily adopted in both existing and emerging systems. Our experiments demonstrate that, on average, our NDA-based system consumes 46% (68%) lower (data transfer) energy at 1.67× higher performance than a system that integrates the same accelerator logic within the processor itself.

[1]  Duncan G. Elliott,et al.  Computational Ram: A Memory-simd Hybrid And Its Application To Dsp , 1992, 1992 Proceedings of the IEEE Custom Integrated Circuits Conference.

[2]  Michael F. Deering,et al.  FBRAM: a new form of memory optimized for 3D graphics , 1994, SIGGRAPH.

[3]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[4]  John Wawrzynek,et al.  Garp: a MIPS processor with a reconfigurable coprocessor , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[5]  Noah Treuhaft,et al.  Intelligent RAM (IRAM): the industrial setting, applications, and architectures , 1997, Proceedings International Conference on Computer Design VLSI in Computers and Processors.

[6]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[7]  Frederic T. Chong,et al.  Active pages: a computation model for intelligent memory , 1998, ISCA.

[8]  Andreas Moshovos,et al.  CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit , 2000, ISCA '00.

[9]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[10]  John Wawrzynek,et al.  The Garp Architecture and C Compiler , 2000, Computer.

[11]  Alvin R. Lebeck,et al.  Power aware page allocation , 2000, SIGP.

[12]  Reiner W. Hartenstein,et al.  A decade of reconfigurable computing: a visionary retrospective , 2001, Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001.

[13]  Reiner W. Hartenstein Coarse grain reconfigurable architectures , 2001, Proceedings of the ASP-DAC 2001. Asia and South Pacific Design Automation Conference 2001 (Cat. No.01EX455).

[14]  Chun Chen,et al.  The architecture of the DIVA processing-in-memory chip , 2002, ICS '02.

[15]  Scott Hauck,et al.  Reconfigurable computing: a survey of systems and software , 2002, CSUR.

[16]  A. Tsai,et al.  PipeRench: A virtualized programmable datapath in 0.18 micron technology , 2002, Proceedings of the IEEE 2002 Custom Integrated Circuits Conference (Cat. No.02CH37285).

[17]  Rudy Lauwereins,et al.  ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix , 2003, FPL.

[18]  Won-Jong Lee,et al.  A Scalable GPU Architecture based on Dynamically Reconfigurable Embedded Processor , 2005 .

[19]  Josep Torrellas,et al.  A Near-Memory Processor for Vector, Streaming and Bit Manipulation Workloads , 2005 .

[20]  William J. Dally,et al.  Scatter-add in data parallel architectures , 2005, 11th International Symposium on High-Performance Computer Architecture.

[21]  Seth Copen Goldstein,et al.  Tartan: evaluating spatial computation for whole program execution , 2006, ASPLOS XII.

[22]  Soo-In Cho,et al.  A 512-Mb DDR3 SDRAM prototype with CIO minimization and self-calibration techniques , 2006, VLSIC 2006.

[23]  A Survey of Multi-Core Coarse-Grained Reconfigurable Arrays for Embedded Applications , 2007 .

[24]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[25]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[26]  Hsien-Hsin S. Lee,et al.  POD: A 3D-Integrated Broad-Purpose Acceleration Layer , 2008, IEEE Micro.

[27]  Serge J. Belongie,et al.  SD-VBS: The San Diego Vision Benchmark Suite , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[28]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[29]  So-Ra Kim,et al.  8Gb 3D DDR3 DRAM using through-silicon-via technology , 2009, 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[30]  G. Edward Suh,et al.  Flexible and Efficient Instruction-Grained Run-Time Monitoring Using On-Chip Reconfigurable Fabric , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[31]  Young-Hyun Jun,et al.  8 Gb 3-D DDR3 DRAM Using Through-Silicon-Via Technology , 2009, IEEE Journal of Solid-State Circuits.

[32]  David H. Albonesi,et al.  ReMAP: A Reconfigurable Heterogeneous Multicore Architecture , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[33]  Christoforos E. Kozyrakis,et al.  Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[34]  Bradford M. Beckmann,et al.  The gem5 simulator , 2011, CARN.

[35]  Zhen Fang,et al.  Active memory controller , 2012, The Journal of Supercomputing.

[36]  Karthikeyan Sankaralingam,et al.  Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[37]  Young-Hyun Jun,et al.  A 1.2V 12.8GB/s 2Gb mobile Wide-I/O DRAM with 4×128 I/Os using TSV-based stacking , 2011, 2011 IEEE International Solid-State Circuits Conference.

[38]  Tomonori Sekiguchi,et al.  1-Tbyte/s 1-Gbit DRAM Architecture Using 3-D Interconnect for High-Throughput Computing , 2011, IEEE Journal of Solid-State Circuits.

[39]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[40]  Michael C. Huang,et al.  Efficient data streaming with on-chip accelerators: Opportunities and challenges , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[41]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[42]  Engin Ipek,et al.  A resistive TCAM accelerator for data-intensive computing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[43]  Pankaj Shailendra Gode,et al.  Function inlining and loop unrolling for loop acceleration in reconfigurable processors , 2012, CASES '12.

[44]  Hsien-Hsin S. Lee,et al.  3D-MAPS: 3D Massively parallel processor with stacked memory , 2012, 2012 IEEE International Solid-State Circuits Conference.

[45]  Seung-Moon Yoo,et al.  FlexRAM: Toward an advanced Intelligent Memory system , 1999, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[46]  Jong-Ho Kang,et al.  A 1.2V 23nm 6F2 4Gb DDR3 SDRAM with local-bitline sense amplifier, hybrid LIO sense amplifier and dummy-less array architecture , 2012, 2012 IEEE International Solid-State Circuits Conference.

[47]  David Blaauw,et al.  Centip3De: A 64-Core, 3D Stacked Near-Threshold System , 2012, IEEE Micro.

[48]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[49]  Karthikeyan Sankaralingam,et al.  DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing , 2012, IEEE Micro.

[50]  Young Choi,et al.  A 1.2V 30nm 3.2Gb/s/pin 4Gb DDR4 SDRAM with dual-error detection and PVT-tolerant data-fetch scheme , 2012, 2012 IEEE International Solid-State Circuits Conference.

[51]  Mario Konijnenburg,et al.  ULP-SRP: Ultra low power Samsung Reconfigurable Processor for biomedical applications , 2012, 2012 International Conference on Field-Programmable Technology.

[52]  Young-Hyun Jun,et al.  A 1.2 V 12.8 GB/s 2 Gb Mobile Wide-I/O DRAM With 4 $\times$ 128 I/Os Using TSV Based Stacking , 2011, IEEE Journal of Solid-State Circuits.

[53]  Paolo Ienne,et al.  Elastic CGRAs , 2013, FPGA '13.

[54]  Gabriel H. Loh Nuwan Jayasena Mark H. Oskin Mark Nutter Da Ignatowski A Processing-in-Memory Taxonomy and a Case for Studying Fixed-function PIM , 2013 .

[55]  Michael M. Swift,et al.  Efficient virtual memory for big memory servers , 2013, ISCA.

[56]  José F. Martínez,et al.  Understanding and mitigating refresh overheads in high-density DDR4 DRAM systems , 2013, ISCA.

[57]  Jun-Seok Park,et al.  A 1.2 V 30 nm 3.2 Gb/s/pin 4 Gb DDR4 SDRAM With Dual-Error Detection and PVT-Tolerant Data-Fetch Scheme , 2012, IEEE Journal of Solid-State Circuits.

[58]  Shekhar Borkar,et al.  Role of Interconnects in the Future of Computing , 2013, Journal of Lightwave Technology.

[59]  O Seongil,et al.  Reducing memory access latency with asymmetric DRAM bank organizations , 2013, ISCA.

[60]  Franz Franchetti,et al.  A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing , 2013, 2013 IEEE International 3D Systems Integration Conference (3DIC).

[61]  Antonia Zhai,et al.  Triggered instructions: a control paradigm for spatially-programmed architectures , 2013, ISCA.

[62]  Karthikeyan Sankaralingam,et al.  A general constraint-centric scheduling framework for spatial architectures , 2013, PLDI.

[63]  Mike Ignatowski,et al.  High-level Programming Model Abstractions for Processing in Memory , 2013 .

[64]  Eby G. Friedman,et al.  AC-DIMM: associative computing with STT-MRAM , 2013, ISCA.

[65]  Jung Ho Ahn,et al.  The McPAT Framework for Multicore and Manycore Architectures: Simultaneously Modeling Power, Area, and Timing , 2013, TACO.

[66]  Ming Yang,et al.  Sonic Millip3De: A massively parallel 3D-stacked accelerator for 3D ultrasound , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[67]  Rajeev Balasubramonian,et al.  Quantifying the relationship between the power delivery network and architectural policies in a 3D-stacked memory device , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[68]  O Seongil,et al.  Row-buffer decoupling: A case for low-latency DRAM microarchitecture , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[69]  Yoav Etsion,et al.  Single-graph multiple flows: Energy efficient design alternative for GPGPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[70]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[71]  Nam Sung Kim,et al.  Energy-efficient reconfigurable cache architectures for accelerator-enabled embedded systems , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[72]  Feifei Li,et al.  Comparing Implementations of Near-Data Computing with In-Memory MapReduce Workloads , 2014, IEEE Micro.

[73]  Jung Ho Ahn,et al.  DRAMA: An Architecture for Accelerated Processing Near Memory , 2015, IEEE Computer Architecture Letters.