High-Throughput, Area-Efficient, and Variation-Tolerant 3-D In-Memory Compute System for Deep Convolutional Neural Networks

Untethered computing using deep convolutional neural networks (DCNNs) at the edge of IoT with limited resources requires systems that are exceedingly power and area-efficient. Analog in-memory matrix-matrix multiplications enabled by emerging memories can significantly reduce the energy budget of such systems and result in compact accelerators. In this article, we report a high-throughput RRAM-based DCNN processor that boasts <inline-formula> <tex-math notation="LaTeX">$7.12\mathbf {\times }$ </tex-math></inline-formula> area-efficiency (AE) and <inline-formula> <tex-math notation="LaTeX">$6.52\mathbf {\times }$ </tex-math></inline-formula> power-efficiency (PE) enhancements over state-of-the-art accelerators. We achieve this by coupling a novel in-memory computing methodology with a staggered-3D memristor array. Our variation-tolerant in-memory compute method, which performs operations on signed floating-point numbers within a single array, leverages charge domain operations and conductance discretization to reduce peripheral overheads. Voltage pulses applied at the staggered bottom electrodes of the 3D-array generate a concurrent input shift and parallelize convolution operations to boost throughput. The high density and low footprint of the 3D-array, along with the modified in-memory M2M execution, improve peak AE to 9.1TOPsmm<sup>−2</sup> while the elimination of input regeneration improves PE to 10.6TOPsW<sup>−1</sup>. This work provides a path towards infallible RRAM-based hardware accelerators that are fast, low power, and low area.

[1]  Tao Zhang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[2]  Pritish Narayanan,et al.  Equivalent-accuracy accelerated neural-network training using analogue memory , 2018, Nature.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Nam Sung Kim,et al.  Mixed-Signal Charge-Domain Acceleration of Deep Neural Networks through Interleaved Bit-Partitioned Arithmetic , 2019, PACT.

[6]  Xiaoming Chen,et al.  Mixed Size Crossbar based RRAM CNN Accelerator with Overlapped Mapping Method , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Yu Wang,et al.  Technological Exploration of RRAM Crossbar Array for Matrix-Vector Multiplication , 2015, Journal of Computer Science and Technology.

[9]  Leibo Liu,et al.  AEPE: An area and power efficient RRAM crossbar-based accelerator for deep CNNs , 2017, 2017 IEEE 6th Non-Volatile Memory Systems and Applications Symposium (NVMSA).

[10]  Yu Wang,et al.  Training itself: Mixed-signal training acceleration for memristor-based neural network , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).

[11]  Catherine E. Graves,et al.  Memristor‐Based Analog Computation and Neural Network Classification with a Dot Product Engine , 2018, Advanced materials.

[12]  Hao Jiang,et al.  RENO: A high-efficient reconfigurable neuromorphic computing accelerator design , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[13]  Catherine Graves,et al.  Dot-product engine for neuromorphic computing: Programming 1T1M crossbar to accelerate matrix-vector multiplication , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[14]  Ya-Chin King,et al.  A Study of the Variability in Contact Resistive Random Access Memory by Stochastic Vacancy Model , 2018, Nanoscale Research Letters.

[15]  J. Yang,et al.  Three-dimensional crossbar arrays of self-rectifying Si/SiO2/Si memristors , 2017, Nature Communications.

[16]  Thomas Toifl,et al.  28.5 A 10b 1.5GS/s pipelined-SAR ADC with background second-stage common-mode regulation and offset calibration in 14nm CMOS FinFET , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[17]  Saurabh Sinha,et al.  32-bit Processor core at 5-nm technology: Analysis of transistor and interconnect impact on VLSI system performance , 2016, 2016 IEEE International Electron Devices Meeting (IEDM).

[18]  Ryutaro Yasuhara,et al.  A 4M Synapses integrated Analog ReRAM based 66.5 TOPS/W Neural-Network Processor with Cell Current Controlled Writing and Flexible Network Architecture , 2018, 2018 IEEE Symposium on VLSI Technology.

[19]  An Chen,et al.  Variability of resistive switching memories and its impact on crossbar array performance , 2011, 2011 International Reliability Physics Symposium.

[20]  Engin Ipek,et al.  Enabling Scientific Computing on Memristive Accelerators , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[21]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[22]  Zhengya Zhang,et al.  A fully integrated reprogrammable memristor–CMOS system for efficient multiply–accumulate operations , 2019, Nature Electronics.

[23]  E. Lehtonen,et al.  CNN using memristors for neighborhood connections , 2010, 2010 12th International Workshop on Cellular Nanoscale Networks and their Applications (CNNA 2010).

[24]  Steven J. Plimpton,et al.  Resistive memory device requirements for a neural algorithm accelerator , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[25]  Jianhui Zhao,et al.  Superior resistive switching memory and biological synapse properties based on a simple TiN/SiO2/p-Si tunneling junction structure , 2017 .

[26]  Yiran Chen,et al.  PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[27]  S. Burc Eryilmaz,et al.  Four-layer 3D vertical RRAM integrated with FinFET as a versatile computing unit for brain-inspired cognitive information processing , 2016, 2016 IEEE Symposium on VLSI Technology.

[28]  Wenqiang Zhang,et al.  Novel In-Memory Matrix-Matrix Multiplication with Resistive Cross-Point Arrays , 2018, 2018 IEEE Symposium on VLSI Technology.

[29]  Lifeng Liu,et al.  Unified Physical Model of Bipolar Oxide-Based Resistive Switching Memory , 2009, IEEE Electron Device Letters.

[30]  Mark Barnell,et al.  Three-dimensional memristor circuits as complex neural networks , 2020, Nature Electronics.

[31]  Huanrui Yang,et al.  AtomLayer: A Universal ReRAM-Based CNN Accelerator with Atomic Layer Computation , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[32]  Cary Y. Yang,et al.  On-Chip Interconnect Conductor Materials for End-of-Roadmap Technology Nodes , 2018, IEEE Transactions on Nanotechnology.

[33]  D. Stewart,et al.  The missing memristor found , 2008, Nature.

[34]  Wei Lu,et al.  The future of electronics based on memristive systems , 2018, Nature Electronics.

[35]  Xiaochen Peng,et al.  MAX2: An ReRAM-Based Neural Network Accelerator That Maximizes Data Reuse and Area Utilization , 2019, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[36]  James R. Glass,et al.  14.4 A scalable speech recognizer with deep-neural-network acoustic models and voice-activated power gating , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[37]  Sanu Mathew,et al.  A 280 mV-to-1.1 V 256b Reconfigurable SIMD Vector Permutation Engine With 2-Dimensional Shuffle in 22 nm Tri-Gate CMOS , 2013, IEEE Journal of Solid-State Circuits.

[38]  Kinam Kim,et al.  A fast, high-endurance and scalable non-volatile memory device made from asymmetric Ta2O(5-x)/TaO(2-x) bilayer structures. , 2011, Nature materials.

[39]  Swagath Venkataramani,et al.  A Scalable Multi-TeraOPS Core for AI Training and Inference , 2018, IEEE Solid-State Circuits Letters.

[40]  Li Ji,et al.  Integrated one diode-one resistor architecture in nanopillar SiOx resistive switching memory by nanosphere lithography. , 2014, Nano letters.

[41]  Qing Wu,et al.  Hardware realization of BSB recall function using memristor crossbar arrays , 2012, DAC Design Automation Conference 2012.

[42]  Qing Wu,et al.  In situ training of feed-forward and recurrent convolutional memristor networks , 2019, Nature Machine Intelligence.

[43]  Wouter A. Serdijn,et al.  Analysis of Power Consumption and Linearity in Capacitive Digital-to-Analog Converters Used in Successive Approximation ADCs , 2011, IEEE Transactions on Circuits and Systems I: Regular Papers.

[44]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[45]  P. Debacker,et al.  Design-technology co-optimization for OxRRAM-based synaptic processing unit , 2017, 2017 Symposium on VLSI Technology.

[46]  A. P. Yatmanov,et al.  Bipolar resistive switching and charge transport in silicon oxide memristor , 2015 .

[47]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[48]  Yusuf Leblebici,et al.  A 3.1mW 8b 1.2GS/s single-channel asynchronous SAR ADC with alternate comparators for enhanced speed in 32nm digital SOI CMOS , 2013, 2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers.

[49]  Doris Schmitt-Landsiedel,et al.  A 0.5V, 1µW successive approximation ADC , 2002 .

[50]  H.-S. Philip Wong,et al.  In-memory computing with resistive switching devices , 2018, Nature Electronics.