Fulcrum: A Simplified Control and Access Mechanism Toward Flexible and Practical In-Situ Accelerators

In-situ approaches process data very close to the memory cells, in the row buffer of each subarray. This minimizes data movement costs and affords parallelism across subarrays. However, current in-situ approaches are limited to only row-wide bitwise (or few-bit) operations applied uniformly across the row buffer. They impose a significant overhead of multiple row activations for emulating 32-bit addition and multiplications using bitwise operations and cannot support operations with data dependencies or based on predicates. Moreover, with current peripheral logic, communication among subarrays is inefficient, and with typical data layouts, bits in a word are not physically adjacent. The key insight of this work is that in-situ, single-word ALUs outperform in-situ, parallel, row-wide, bitwise ALUs by reducing the number of row activations and enabling new operations and optimizations. Our proposed lightweight access and control mechanism, Fulcrum, sequentially feeds data into the single-word ALU and enables operations with data dependencies and operations based on a predicate. For algorithms that require communication among subarrays, we augment the peripheral logic with broadcasting capabilities and a previously-proposed method for low-cost inter-subarray data movement. The sequential processor also enables overlapping of broadcasting and computation, and reuniting bits that are physically adjacent. In order to realize true subarray-level parallelism, we introduce a lightweight column-selection mechanism through shifting one-hot encoded values. This technique enables independent column selection in each subarray. We integrate Fulcrum with Compress Express Link (CXL), a new interconnect standard. Fulcrum with one memory stack delivers on average (up to) 23.4 (76) speedup over a server-class GPU, NVIDIA P100, with three stacks of HBM2 memory, (ii) 70 (228) times speedup per memory stack over the GPU, and (iii) 19 (178.9) times speedup per memory stack over an ideal model of the GPU, which only accounts for the overhead of data movement.

[1]  Scott A. Mahlke,et al.  In-Memory Data Parallel Processor , 2018, ASPLOS.

[2]  Mark Oskin,et al.  Active Page Architectures for Media Processing , 1999 .

[3]  David A. Patterson,et al.  Direction-optimizing Breadth-First Search , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Kevin Skadron,et al.  Dymaxion: Optimizing memory access patterns for heterogeneous systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[5]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[6]  Sudhakar Yalamanchili,et al.  Demystifying the characteristics of 3D-stacked memories: A case study for Hybrid Memory Cube , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[7]  Tomofumi Yuki,et al.  Sparse computation data dependence simplification for efficient compiler-generated inspectors , 2019, PLDI.

[8]  Cong Xu,et al.  Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[9]  Michel Barlaud,et al.  Fast k nearest neighbor search using GPU , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[10]  Mahmoud Reza Hashemi,et al.  Tree-based scheme for reducing shared cache miss rate leveraging regional, statistical and temporal similarities , 2014, IET Comput. Digit. Tech..

[11]  Christoforos E. Kozyrakis,et al.  Scalable Vector Processors for Embedded Systems , 2003, IEEE Micro.

[12]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[13]  Tao Zhang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[14]  David Blaauw,et al.  Compute Caches , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[15]  Jung Ho Ahn,et al.  CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[16]  Jun Yang,et al.  DrAcc: a DRAM based Accelerator for Accurate CNN Inference , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[17]  Neal Cardwell,et al.  Evaluation of Existing Architectures in IRAM Systems , 1998 .

[18]  Babak Falsafi,et al.  The mondrian data engine , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[19]  Kevin Skadron,et al.  eAP: A Scalable and Efficient In-Memory Accelerator for Automata Processing , 2019, MICRO.

[20]  Tao Zhang,et al.  Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[21]  Frederic T. Chong,et al.  Reducing cost and tolerating defects in page-based intelligent memory , 2000, Proceedings 2000 International Conference on Computer Design.

[22]  Gu-Yeon Wei,et al.  Profiling a warehouse-scale computer , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[23]  Gabriel H. Loh,et al.  Challenges of High-Capacity DRAM Stacks and Potential Directions , 2018, MCHPC@SC.

[24]  Kesheng Wu,et al.  FastBit: An Efficient Indexing Technology For Accelerating Data-Intensive Science , 2005 .

[25]  Onur Mutlu,et al.  Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[26]  Tom W. Chen,et al.  Assessing merged DRAM/Logic technology , 1999, Integr..

[27]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[28]  Engin Ipek,et al.  Memristive Boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning , 2017, 2017 Fifth Berkeley Symposium on Energy Efficient Electronic Systems & Steep Transistors Workshop (E3S).

[29]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[30]  Onur Mutlu,et al.  Simultaneous Multi-Layer Access , 2016, ACM Trans. Archit. Code Optim..

[31]  Onur Mutlu,et al.  Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[32]  Kevin Skadron,et al.  FlexAmata: A Universal and Efficient Adaption of Applications to Spatial Automata Processing Accelerators , 2020, ASPLOS.

[33]  Yuan Xie,et al.  SCOPE: A Stochastic Computing Engine for DRAM-Based In-Situ Accelerator , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[34]  Onur Mutlu,et al.  Fast Bulk Bitwise AND and OR in DRAM , 2015, IEEE Computer Architecture Letters.

[35]  Kevin Skadron,et al.  Impala: Algorithm/Architecture Co-Design for In-Memory Multi-Stride Pattern Matching , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[36]  Yuan Xie,et al.  DRISA: A DRAM-based Reconfigurable In-Situ Accelerator , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[37]  Noah Treuhaft,et al.  Intelligent RAM (IRAM): the industrial setting, applications, and architectures , 1997, Proceedings International Conference on Computer Design VLSI in Computers and Processors.

[38]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[39]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[40]  Aamer Jaleel,et al.  ExTensor: An Accelerator for Sparse Tensor Algebra , 2019, MICRO.

[41]  Onur Mutlu,et al.  Tiered-latency DRAM: A low latency and low cost DRAM architecture , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[42]  Mircea R. Stan,et al.  An Overflow-free Quantized Memory Hierarchy in General-purpose Processors , 2019, 2019 IEEE International Symposium on Workload Characterization (IISWC).

[43]  Rachata Ausavarungnirun,et al.  RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[44]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[45]  Onur Mutlu,et al.  Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[46]  Tajana Simunic,et al.  FloatPIM: In-Memory Acceleration of Deep Neural Network Training with High Precision , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).