Design space exploration for PIM architectures in 3D-stacked memories

Scaling existing architectures to large-scale data-intensive applications is limited by energy and performance losses caused by off-chip memory communication and data movements in the cache hierarchy. Processing-in-Memory (PIM) has been recently revisited to address the issues of memory and power wall, mainly due to the maturity of 3D-stacking manufacturing technology and the increasing demand for bandwidth and parallel access in emerging data-centric applications. Recent studies have shown a wide variety of processing mechanisms to be placed in the logic layer of 3D-stacked memories, not to mention the already available 3D-stacked DRAMs, such as Micron's Hybrid Memory Cube (HMC). Nevertheless, a few studies compare PIM accelerators to each other and have made efforts to indicate the trade-offs between power, area, and performance. In this paper, we review different state-of-the-art 3D-stacked in-memory accelerators, and we analyze them considering important constraints regarding area and power due to critical embedded nature of PIM. Aiming to point in the direction of massive parallel PIM designs, we take the simplest design found in this survey, and we explore the architectural design space to meet the constraints imposed by HMC. Our results show that the most straightforward approach can provide the highest performance while consuming the lowest amount of area and power, which makes it the most suitable design found in this survey for an energy-efficient in-memory accelerator, whether it goes in High-Performance Computing or Embedded Systems. For instance, the outstanding point in the design space indicates that a performance density of 320 GBps/mm2 and a performance efficiency of 0.6 GBps/mW can be achieved in the best scenario, that is, when a massive parallel application reaches the peak bandwidth.

[1]  Jinyoung Lee,et al.  Biscuit: A Framework for Near-Data Processing of Big Data Workloads , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[2]  Tejas Karkhanis,et al.  Active Memory Cube: A processing-in-memory architecture for exascale systems , 2015, IBM J. Res. Dev..

[3]  Christoforos E. Kozyrakis,et al.  TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[4]  Sudhakar Yalamanchili,et al.  Lightweight SIMT core designs for intelligent 3D stacked DRAM , 2017, MEMSYS.

[5]  Sudhakar Yalamanchili,et al.  Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[6]  Norbert Wehn,et al.  Retention time measurements and modelling of bit error rates of WIDE I/O DRAM in MPSoCs , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[7]  Rainer Buchty,et al.  Data-Centric Computing Frontiers: A Survey On Processing-In-Memory , 2016, MEMSYS.

[8]  Gabriel H. Loh Nuwan Jayasena Mark H. Oskin Mark Nutter Da Ignatowski A Processing-in-Memory Taxonomy and a Case for Studying Fixed-function PIM , 2013 .

[9]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[10]  Jung Ho Ahn,et al.  Accelerating linked-list traversal through near-data processing , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[11]  Babak Falsafi,et al.  The mondrian data engine , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[12]  Krishna M. Kavi,et al.  Exploring the Processing-in-Memory design space , 2017, J. Syst. Archit..

[13]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[14]  Seung-Moon Yoo,et al.  FlexRAM: Toward an advanced Intelligent Memory system , 1999, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[15]  Dionysios I. Reisis,et al.  An efficient multiple precision floating-point Multiply-Add Fused unit , 2016, Microelectron. J..

[16]  Li Shen,et al.  A New Architecture For Multiple-Precision Floating-Point Multiply-Add Fused Unit Design , 2007, 18th IEEE Symposium on Computer Arithmetic (ARITH '07).

[17]  Luigi Carro,et al.  NIM: An HMC-Based Machine for Neuron Computation , 2017, ARC.

[18]  Luigi Carro,et al.  Processing in 3D memories to speed up operations on complex data structures , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[19]  Gerard J. M. Smit,et al.  Sabrewing: A lightweight architecture for combined floating-point and integer arithmetic , 2012, TACO.

[20]  Feifei Li,et al.  NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[21]  J. Jeddeloh,et al.  Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[22]  Mingyu Gao,et al.  HRL: Efficient and flexible reconfigurable logic for near-data processing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[23]  David Blaauw,et al.  Compute Caches , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[24]  R.H. Dennard,et al.  Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.

[25]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[27]  Franz Franchetti,et al.  Data reorganization in memory using 3D-stacked DRAM , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[28]  Luigi Carro,et al.  Large vector extensions inside the HMC , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[29]  Gabriel H. Loh,et al.  Thermal Feasibility of Die-Stacked Processing in Memory , 2014 .

[30]  Luca Benini,et al.  Design and Evaluation of a Processing-in-Memory Architecture for the Smart Memory Cube , 2016, ARCS.

[31]  Franz Franchetti,et al.  A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing , 2013, 2013 IEEE International 3D Systems Integration Conference (3DIC).

[32]  Duncan G. Elliott,et al.  Computational RAM: Implementing Processors in Memory , 1999, IEEE Des. Test Comput..

[33]  Luigi Carro,et al.  Operand size reconfiguration for big data processing in memory , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[34]  Jung Ho Ahn,et al.  CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[35]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.