Reconfigurable dataflow graphs for processing-in-memory

In order to meet the ever-increasing speed differences between processor clocks and memory access times, there has been an interest in moving computation closer to memory. The near data processing or processing-in-memory is particularly suited for very high bandwidth memories such as the 3D-DRAMs. There are different ideas proposed for PIMs, including simple in-order processors, GPUs, specialized ASICs and reconfigurable designs. In our case, we use Coarse-Grained Reconfigurable Logic to build dataflow graphs for computational kernels as the PIM. We show that our approach can achieve significant speedups and save energy consumed by computations. We evaluated our designs using several processing technologies for building the coarse-gained logic units. The DFPIM concept showed good performance improvement and excellent energy efficiency for the streaming benchmarks that were analyzed. The DFPIM in a 28 nm process with an implementation in each of 16 vaults of a 3D-DRAM logic layer showed an average speed-up of 7.2 over that using 32 cores of an Intel Xeon server system. The server processor required 368 times more energy to execute the benchmarks than the DFPIM implementation.

[1]  Mingyu Gao,et al.  HRL: Efficient and flexible reconfigurable logic for near-data processing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[2]  Alain C. Diebold,et al.  2012 Updates to the International Technology Roadmap for Semiconductors (ITRS) Metrology Chapter | NIST , 2013 .

[3]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[4]  Krishna M. Kavi,et al.  Concurrency, Synchronization, and Speculation - The Dataflow Way , 2015, Adv. Comput..

[5]  Arvind Data flow languages and architectures , 1981, ISCA '81.

[6]  Jung Ho Ahn,et al.  NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[7]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[8]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[9]  Krishna M. Kavi,et al.  Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation , 2001, IEEE Trans. Computers.

[10]  Franz Franchetti,et al.  Data reorganization in memory using 3D-stacked DRAM , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[11]  Karthikeyan Sankaralingam,et al.  DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing , 2012, IEEE Micro.

[12]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[13]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[14]  Krishna M. Kavi,et al.  Processing-in-Memory: Exploring the Design Space , 2015, ARCS.

[15]  Karthick Rajamani,et al.  Energy Management for Commercial Servers , 2003, Computer.

[16]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[17]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[18]  Amin Ansari,et al.  Bundled execution of recurring traces for energy-efficient general purpose processing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  Chao Yang,et al.  Solving Mesoscale Atmospheric Dynamics Using a Reconfigurable Dataflow Architecture , 2017, IEEE Micro.

[20]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[21]  David E. Culler,et al.  Dataflow architectures , 1986 .

[22]  Karthikeyan Sankaralingam,et al.  Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[23]  David A. Patterson,et al.  Latency lags bandwith , 2004, CACM.

[24]  J. Jeddeloh,et al.  Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[25]  Yoav Etsion,et al.  Single-graph multiple flows: Energy efficient design alternative for GPGPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[26]  Seyong Lee,et al.  PUMA: Purdue MapReduce Benchmarks Suite , 2012 .

[27]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.