论文信息 - MAPIM: Mat Parallelism for High Performance Processing in Non-volatile Memory Architecture

MAPIM: Mat Parallelism for High Performance Processing in Non-volatile Memory Architecture

In the Internet of Things (IoT) era, data movement between processing units and memory is a critical factor in the overall system performance. Processing-in-Memory (PIM) is a promising solution to address this bandwidth bottleneck by performing a portion of computation inside the memory. Many prior studies have enabled various PIM operations on nonvolatile memory (NVM) by modifying sense amplifiers (SA). They exploit a single sense amplifier to handle multiple bitlines with a multiplexer (MUX) since a single SA circuit takes much larger area than an NVM 1-bit cell. This limits potential parallelism that the PIM techniques can ideally achieve. In this paper, we propose MAPIM, mat parallelism for high-performance processing in non-volatile memory architecture. Our design carries out multiple bit-lines (BLs) requests under a MUX in parallel with two novel design components, multi-column/row latch (MCRL) and shared SA routing (SSR). The MCRL allows the address decoder to activate multiple addresses in both column and row directions by buffering the consecutively-requested addresses. The activated bits are simultaneously sensed by the multiple SAs across a MUX based on the SSR technique. The experimental results show that MAPIM is up to $\pmb{339}\times$ faster and $\pmb{ 221}\times$ more energy efficient than a GPGPU. As compared to the state-of-the-art PIM designs, our design is $\pmb{16}\times$ faster and $\pmb{ 1.8}\times$ more energy efficient with insignificant area overhead.

[1] Onur Mutlu,et al. Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2] Wongyu Shin,et al. Rank-Level Parallelism in DRAM , 2017, IEEE Transactions on Computers.

[3] Onur Mutlu,et al. A case for exploiting subarray-level parallelism (SALP) in DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[4] Daniel E. O'Leary,et al. Artificial Intelligence and Big Data , 2013, IEEE Intelligent Systems.

[5] Hao Wu,et al. Two-terminal vertical memory cell for cross-point static random access memory applications , 2014 .

[6] Tajana Simunic,et al. MPIM: Multi-purpose in-memory processing using configurable resistive memory , 2017, 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC).

[7] Yuan Xie,et al. DRISA: A DRAM-based Reconfigurable In-Situ Accelerator , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8] Matthew Poremba,et al. NVMain: An Architectural-Level Main Memory Simulator for Emerging Non-volatile Memories , 2012, 2012 IEEE Computer Society Annual Symposium on VLSI.

[9] Mattan Erez,et al. A locality-aware memory hierarchy for energy-efficient GPU architectures , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10] William J. Dally,et al. Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11] Mika Laiho,et al. Stateful implication logic with memristors , 2009, 2009 IEEE/ACM International Symposium on Nanoscale Architectures.

[12] Cong Xu,et al. NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[13] Tajana Simunic,et al. LUPIS: Latch-up based ultra efficient processing in-memory system , 2018, 2018 19th International Symposium on Quality Electronic Design (ISQED).

[14] Onur Mutlu,et al. Improving memory Bank-Level Parallelism in the presence of prefetching , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15] Tao Zhang,et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[16] Qi Wang,et al. A 20nm 1.8V 8Gb PRAM with 40MB/s program bandwidth , 2012, 2012 IEEE International Solid-State Circuits Conference.

[17] David R. Kaeli,et al. Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[18] Cheol Seong Hwang,et al. Prospective of Semiconductor Memory Devices: from Memory System to Materials , 2015 .

[19] Guangyu Sun,et al. PM3: Power Modeling and Power Management for Processing-in-Memory , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[20] Tao Zhang,et al. Fine-granularity tile-level parallelism in non-volatile memory architecture with two-dimensional bank subdivision , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[21] Kiyoung Choi,et al. AIM , 2016 .

[22] Kiyoung Choi,et al. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[23] Cong Xu,et al. Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[24] Onur Mutlu,et al. Fast Bulk Bitwise AND and OR in DRAM , 2015, IEEE Computer Architecture Letters.

[25] Tao Zhang,et al. Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[26] Eby G. Friedman,et al. VTEAM – A General Model for Voltage Controlled Memristors , 2014 .