PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory

Processing-in-memory (PIM) is a promising solution to address the “memory wall” challenges for future computer systems. Prior proposed PIM architectures put additional computation logic in or near memory. The emerging metal-oxide resistive random access memory (ReRAM) has showed its potential to be used for main memory. Moreover, with its crossbar array structure, ReRAM can perform matrixvector multiplication efficiently, and has been widely studied to accelerate neural network (NN) applications. In this work, we propose a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory. In PRIME, a portion of ReRAM crossbar arrays can be configured as accelerators for NN applications or as normal memory for a larger memory space. We provide microarchitecture and circuit designs to enable the morphable functions with an insignificant area overhead. We also design a software/hardware interface for software developers to implement various NNs on PRIME. Benefiting from both the PIM architecture and the efficiency of using ReRAM for NN computation, PRIME distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving. Our experimental results show that, compared with a state-of-the-art neural processing unit design, PRIME improves the performance by ~2360x and the energy consumption by ~895x, across the evaluated machine learning benchmarks.

[1]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[2]  Anoop Gupta,et al.  Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[3]  Noah Treuhaft,et al.  Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[4]  Kunle Olukotun,et al.  A Single Chip Multiprocessor Integrated with DRAM , 1997 .

[5]  Duncan G. Elliott,et al.  Computational RAM: The case for SIMD computing in memory , 1997 .

[6]  Noah Treuhaft,et al.  Intelligent RAM (IRAM): the industrial setting, applications, and architectures , 1997, Proceedings International Conference on Computer Design VLSI in Computers and Processors.

[7]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[8]  Frederic T. Chong,et al.  Active pages: a computation model for intelligent memory , 1998, ISCA.

[9]  Chun Chen,et al.  The architecture of the DIVA processing-in-memory chip , 2002, ICS '02.

[10]  Sanjeev Kumar,et al.  Dynamic tracking of page miss ratio curve for memory management , 2004, ASPLOS XI.

[11]  Yasar Becerikli,et al.  Neural Network Implementation in Hardware Using FPGAs , 2006, ICONIP.

[12]  Yann LeCun,et al.  CNP: An FPGA-based processor for Convolutional Networks , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[13]  Vijayalakshmi Srinivasan,et al.  Enhancing lifetime and security of PCM-based Main Memory with Start-Gap Wear Leveling , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14]  Hoi-Jun Yoo,et al.  A 201.4 GOPS 496 mW Real-Time Multi-Object Recognition Processor With Bio-Inspired Neural Perception Engine , 2009, IEEE Journal of Solid-State Circuits.

[15]  Moinuddin K. Qureshi,et al.  Morphable memory system: a robust architecture for exploiting multi-level phase change memories , 2010, ISCA.

[16]  Shimeng Yu,et al.  Investigating the switching dynamics and multilevel capability of bipolar metal oxide resistive switching memory , 2011 .

[17]  Dharmendra S. Modha,et al.  A digital neurosynaptic core using embedded crossbar memory with 45pJ per spike in 45nm , 2011, 2011 IEEE Custom Integrated Circuits Conference (CICC).

[18]  Chung Lam,et al.  A Novel Reconfigurable Sensing Scheme for Variable Level Storage in Phase Change Memory , 2011, 2011 3rd IEEE International Memory Workshop (IMW).

[19]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[20]  Luca Maria Gambardella,et al.  Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Flexible, High Performance Convolutional Neural Networks for Image Classification , 2022 .

[21]  Fabien Alibart,et al.  Hybrid CMOS/nanodevice circuits for high throughput pattern matching applications , 2011, 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS).

[22]  Yong Liu,et al.  A 45nm CMOS neuromorphic chip with a scalable architecture for learning in networks of spiking neurons , 2011, 2011 IEEE Custom Integrated Circuits Conference (CICC).

[23]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[24]  Kinam Kim,et al.  A fast, high-endurance and scalable non-volatile memory device made from asymmetric Ta2O(5-x)/TaO(2-x) bilayer structures. , 2011, Nature materials.

[25]  Engin Ipek,et al.  A resistive TCAM accelerator for data-intensive computing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  Jung Ho Ahn,et al.  CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[27]  Qing Wu,et al.  Hardware realization of BSB recall function using memristor crossbar arrays , 2012, DAC Design Automation Conference 2012.

[28]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Shimeng Yu,et al.  Metal–Oxide RRAM , 2012, Proceedings of the IEEE.

[30]  Shoji Sakamoto,et al.  An 8Mb multi-layered cross-point ReRAM macro with 443MB/s write throughput , 2012, 2012 IEEE International Solid-State Circuits Conference.

[31]  Cong Xu,et al.  NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[32]  J. Jeddeloh,et al.  Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[33]  W. Jang,et al.  A study on low-power, nanosecond operation and multilevel bipolar resistance switching in Ti/ZrO2/Pt nonvolatile memory with 1T1R architecture , 2012 .

[34]  Cong Xu,et al.  Design trade-offs for high density cross-point resistive memory , 2012, ISLPED '12.

[35]  Ligang Gao,et al.  High precision tuning of state for memristive devices by adaptable variation-tolerant algorithm , 2011, Nanotechnology.

[36]  Andrew B. Kahng,et al.  CACTI-IO: CACTI with off-chip power-area-timing models , 2012, 2012 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[37]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2014, IEEE Micro.

[38]  D. Strukov,et al.  A High Resolution Nonvolatile Analog Memory Ionic Devices , 2013 .

[39]  Yu Wang,et al.  Memristor-based approximated computation , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[40]  Shimeng Yu,et al.  3D vertical RRAM - Scaling limit analysis and demonstration of 3D array operation , 2013, 2013 Symposium on VLSI Technology.

[41]  Masahide Matsumoto,et al.  A 130.7mm2 2-layer 32Gb ReRAM memory device in 24nm technology , 2013, 2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers.

[42]  Andrew S. Cassidy,et al.  Cognitive computing systems: Algorithms and applications for networks of neurosynaptic cores , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[43]  Norman P. Jouppi,et al.  Understanding the trade-offs in multi-level cell ReRAM memory design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[44]  John Shalf,et al.  Design of a large-scale storage-class RRAM system , 2013, ICS '13.

[45]  Chung-Wei Hsu,et al.  Self-rectifying bipolar TaOx/TiO2 RRAM with superior endurance over 1012 cycles for 3D high-density storage-class memory , 2013, 2013 Symposium on VLSI Technology.

[46]  Rachata Ausavarungnirun,et al.  RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[47]  Wei Zhang,et al.  Digital-assisted noise-eliminating training for memristor crossbar-based analog neuromorphic computing engine , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[48]  Fabien Alibart,et al.  Pattern classification by memristive crossbar circuits using ex situ and in situ training , 2013, Nature Communications.

[49]  Yukio Hayakawa,et al.  An 8 Mb Multi-Layered Cross-Point ReRAM Macro With 443 MB/s Write Throughput , 2012, IEEE Journal of Solid-State Circuits.

[50]  Chris Yakopcic,et al.  Exploring the design space of specialized multicore neural processors , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[51]  Eby G. Friedman,et al.  AC-DIMM: associative computing with STT-MRAM , 2013, ISCA.

[52]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[53]  Yiran Chen,et al.  BSB training scheme implementation on memristor-based circuit , 2013, 2013 IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA).

[54]  Yiran Chen,et al.  Reduction and IR-drop compensations techniques for reliable neuromorphic computing systems , 2014, 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[55]  Steven Swanson,et al.  Near-Data Processing: Insights from a MICRO-46 Workshop , 2014, IEEE Micro.

[56]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[57]  Yu Wang,et al.  Training itself: Mixed-signal training acceleration for memristor-based neural network , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).

[58]  Luis Ceze,et al.  General-purpose code acceleration with limited-precision analog computation , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[59]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[60]  Feifei Li,et al.  NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[61]  Zvika Guz Real-Time Analytics as the Killer Application for Processing-In-Memory , 2014 .

[62]  Yoshua Bengio,et al.  Low precision storage for deep learning , 2014 .

[63]  G. W. Burr,et al.  Experimental demonstration and tolerancing of a large-scale neural network (165,000 synapses), using phase-change memory as the synaptic weight element , 2015, 2014 IEEE International Electron Devices Meeting.

[64]  Jaejin Lee,et al.  25.2 A 1.2V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[65]  Cong Xu,et al.  Architecting 3D vertical resistive memory for next-generation storage systems , 2014, 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[66]  Jacques-Olivier Klein,et al.  Spin-transfer torque magnetic memory as a stochastic memristive synapse , 2014, 2014 IEEE International Symposium on Circuits and Systems (ISCAS).

[67]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[68]  Dmitri B. Strukov,et al.  SpongeDirectory: Flexible sparse directories utilizing multi-level memristors , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[69]  Tao Zhang,et al.  Overcoming the challenges of crossbar resistive memory architectures , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[70]  Jacob Nelson,et al.  SNNAP: Approximate computing on programmable SoCs via neural acceleration , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[71]  Yong Zhang,et al.  A Reconfigurable Digital Neuromorphic Processor with Memristive Synaptic Crossbar for Cognitive Computing , 2015, ACM J. Emerg. Technol. Comput. Syst..

[72]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[73]  Babak Falsafi,et al.  Sort vs. Hash Join Revisited for Near-Memory Execution , 2015 .

[74]  Stephen W. Keckler,et al.  Page Placement Strategies for GPUs within Heterogeneous Memory Systems , 2015, ASPLOS.

[75]  Xuehai Zhou,et al.  PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[76]  Franz Franchetti,et al.  Data reorganization in memory using 3D-stacked DRAM , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[77]  Tejas Karkhanis,et al.  Active Memory Cube: A processing-in-memory architecture for exascale systems , 2015, IBM J. Res. Dev..

[78]  Y. Leblebici,et al.  Large-scale neural networks implemented with non-volatile memory as the synaptic weight element: Comparative performance analysis (accuracy, speed, and power) , 2015, 2015 IEEE International Electron Devices Meeting (IEDM).

[79]  Farnood Merrikh-Bayat,et al.  Training and operation of an integrated neuromorphic network based on metal-oxide memristors , 2014, Nature.

[80]  Peng Huang,et al.  Optimized learning scheme for grayscale image recognition in a RRAM based analog neuromorphic system , 2015, 2015 IEEE International Electron Devices Meeting (IEDM).

[81]  Andrew B. Kahng,et al.  CACTI-IO: CACTI With OFF-Chip Power-Area-Timing Models , 2015, IEEE Trans. Very Large Scale Integr. Syst..

[82]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[83]  Yu Wang,et al.  RRAM-Based Analog Approximate Computing , 2015, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[84]  Yu Wang,et al.  SEAL-lab Technical Report – No . 2015-001 ( April 29 , 2016 ) Processing-in-Memory in ReRAM-based Main Memory , 2016 .

[85]  Shaoli Liu,et al.  Cambricon: An Instruction Set Architecture for Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[86]  Cong Xu,et al.  Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[87]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).