Lightweight SIMT core designs for intelligent 3D stacked DRAM

In this work we present an analysis of the Harmonica stream multiprocessor, a light-weight, parameterized, open-source single-instruction-multiple-thread (SIMT) core designed for integration within 3D-stacked DRAM. We evaluate the range of Harmonica designs afforded by the architecture's parameter space in the role of a vault-level accelerator, augmenting a design similar to the Micron Hybrid Memory Cube into an array of compact accelerated DRAM channels. In this role, with a small SRAM cache, Harmonica cores are capable of providing the requisite small footprint, energy efficiency, latency tolerance, and bandwidth demand to perform well. The instruction set and microarchitecture of Harmonica are both novel, providing a lightweight interface for thread creation within the SIMT model and a simple design that issues a single warp per cycle, simplifying the register file design compared to high-performance GPUs, and providing parameters for attributes from the number of warps and threads per warp to the number of general purpose registers per thread. For our suite of analytics-oriented benchmarks, Harmonica cores consuming on the order of 100mW of power maintain a demand for an average of 12GB/s of bandwidth while tolerating the latency present in a DRAM-based memory system.

[1]  Duncan G. Elliott,et al.  Computational Ram: A Memory-simd Hybrid And Its Application To Dsp , 1992, 1992 Proceedings of the IEEE Custom Integrated Circuits Conference.

[2]  Gabriel H. Loh Computer architecture for die stacking , 2012, Proceedings of Technical Program of 2012 VLSI Technology, System and Application.

[3]  Reena Panda,et al.  Prefetching Techniques for Near-memory Throughput Processors , 2016, ICS.

[4]  Mingyu Gao,et al.  HRL: Efficient and flexible reconfigurable logic for near-data processing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[5]  Scott A. Mahlke,et al.  An architecture framework for transparent instruction set customization in embedded processors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[6]  Pradeep Dubey,et al.  Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort , 2010, SIGMOD Conference.

[7]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[8]  Russell Tessier,et al.  FlexGrip: A soft GPGPU for FPGAs , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[9]  Jing Li,et al.  Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search , 2017, FPGA.

[10]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[11]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[12]  Sudhakar Yalamanchili,et al.  Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[13]  Gu-Yeon Wei,et al.  Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[14]  Timothy N. Miller,et al.  NyuziRaster: Optimizing rasterizer performance and energy in the Nyuzi open source GPU , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[15]  J. Jeddeloh,et al.  Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[16]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[17]  John Wawrzynek,et al.  Chisel: Constructing hardware in a Scala embedded language , 2012, DAC Design Automation Conference 2012.

[18]  Mahmut T. Kandemir,et al.  Scheduling techniques for GPU architectures with processing-in-memory capabilities , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[19]  Franz Franchetti,et al.  3D DRAM based application specific hardware accelerator for SpMV , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[20]  Mayler G. A. Martins,et al.  Open Cell Library in 15nm FreePDK Technology , 2015, ISPD.

[21]  Kunle Olukotun,et al.  Automatic Generation of Efficient Accelerators for Reconfigurable Hardware , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).