3 D-Stacked Memory-Side Acceleration : Accelerator and System Design

Specialized hardware acceleration is an effective technique to mitigate the dark silicon problems. A challenge in designing on-chip hardware accelerators for data-intensive applications is how to efficiently transfer data between the memory hierarchy and the accelerators. Although the Processingin-Memory (PIM) technique has the potential to reduce the overhead of data transfers, it is limited by the traditional process technology. Recent process technology advancements such as 3Ddie stacking enable efficient PIM architectures by integrating accelerators to the logic layer of 3D DRAM, thus leading to the concept of the 3D-stacked Memory-Side Accelerator (MSA). In this paper, we initially present the overall architecture of the 3D-stacked MSA, which relies on a configurable array of domain-specific accelerators. Thereafter, we describe a full-system prototype that is built upon a novel software stack and a hybrid evaluation methodology. Experimental results demonstrate that the 3D-stacked MSA achieves up to 179x and 96x better energyefficiency than the Intel Haswell processor for the FFT and matrix transposition algorithms, respectively.

[1]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[2]  Frederic T. Chong,et al.  Active pages: a computation model for intelligent memory , 1998, ISCA.

[3]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[4]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[5]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Steven Swanson,et al.  Conservation cores: reducing the energy of mature computations , 2010, ASPLOS XV.

[7]  Bradford M. Beckmann,et al.  The gem5 simulator , 2011, CARN.

[8]  Jung Ho Ahn,et al.  CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[9]  J. Jeddeloh,et al.  Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[10]  Shirley Moore,et al.  Measuring Energy and Power with PAPI , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[11]  Michael Bedford Taylor,et al.  Is dark silicon useful? Harnessing the four horsemen of the coming dark silicon apocalypse , 2012, DAC Design Automation Conference 2012.

[12]  Kenneth A. Ross,et al.  Navigating big data with high-throughput, energy-efficient data partitioning , 2013, ISCA.

[13]  Andrey Vladimirov Multithreaded Transposition of Square Matrices with Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors , 2013 .

[14]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2014, IEEE Micro.

[15]  Thomas F. Wenisch,et al.  Thin servers with smart pipes: designing SoC accelerators for memcached , 2013, ISCA.

[16]  Mike Ignatowski,et al.  High-level Programming Model Abstractions for Processing in Memory , 2013 .

[17]  Babak Falsafi,et al.  Meet the walkers accelerating index traversals for in-memory databases , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Franz Franchetti,et al.  Understanding the design space of DRAM-optimized hardware FFT accelerators , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[19]  Ronald G. Dreslinski,et al.  Sources of error in full-system simulation , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[20]  Tianshi Chen,et al.  ArchRanker: A ranking approach to design space exploration , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[21]  Feifei Li,et al.  NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[22]  Doru-Thom Popovici,et al.  Algorithm/hardware co-optimized SAR image reconstruction with 3D-stacked logic in memory , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[23]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[24]  Feifei Li,et al.  Comparing Implementations of Near-Data Computing with In-Memory MapReduce Workloads , 2014, IEEE Micro.

[25]  Franz Franchetti,et al.  HAMLeT: Hardware accelerated memory layout transform within 3D-stacked DRAM , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).