Multiscale Co-Design Analysis of Energy, Latency, Area, and Accuracy of a ReRAM Analog Neural Training Accelerator

Neural networks are an increasingly attractive algorithm for natural language processing and pattern recognition. Deep networks with >50 M parameters are made possible by modern graphics processing unit clusters operating at <50 pJ per op and more recently, production accelerators are capable of <5 pJ per operation at the board level. However, with the slowing of CMOS scaling, new paradigms will be required to achieve the next several orders of magnitude in performance per watt gains. Using an analog resistive memory (ReRAM) crossbar to perform key matrix operations in an accelerator is an attractive option. This paper presents a detailed design using the state-of-the-art 14/16 nm process development kit for of an analog crossbar circuit block designed to process three key kernels required in training and inference of neural networks. A detailed circuit and device-level analysis of energy, latency, area, and accuracy are given and compared with relevant designs using standard digital ReRAM and static random access memory (SRAM) operations. It is shown that the analog accelerator has <inline-formula> <tex-math notation="LaTeX">$270\times $ </tex-math></inline-formula> energy and <inline-formula> <tex-math notation="LaTeX">$540\times $ </tex-math></inline-formula> latency advantage over a similar block utilizing only digital ReRAM and takes only 11 fJ per multiply and accumulate. Compared with an SRAM-based accelerator, the energy is <inline-formula> <tex-math notation="LaTeX">$430\times $ </tex-math></inline-formula> better and latency is <inline-formula> <tex-math notation="LaTeX">$34\times $ </tex-math></inline-formula> better. Although training accuracy is degraded in the analog accelerator, several options to improve this are presented. The possible gains over a similar digital-only version of this accelerator block suggest that continued optimization of analog resistive memories is valuable. This detailed circuit and device analysis of a training accelerator may serve as a foundation for further architecture-level studies.

[1]  Kailash Gopalakrishnan,et al.  Overview of candidate device technologies for storage-class memory , 2008, IBM J. Res. Dev..

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  M. Prezioso,et al.  A multiply-add engine with monolithically integrated 3D memristor crossbar/CMOS hybrid circuit , 2017, Scientific reports.

[4]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[5]  Andrew J. Lohn,et al.  A CMOS Compatible, Forming Free TaOx ReRAM , 2013 .

[6]  Tao Zhang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[7]  Pritish Narayanan,et al.  Exploring the Design Space for Crossbar Arrays Built With Mixed-Ionic-Electronic-Conduction (MIEC) Access Devices , 2015, IEEE Journal of the Electron Devices Society.

[8]  Karthikeyan Sankaralingam,et al.  Power challenges may end the multicore era , 2013, CACM.

[9]  Chung Lam,et al.  A Novel Reconfigurable Sensing Scheme for Variable Level Storage in Phase Change Memory , 2011, 2011 3rd IEEE International Memory Workshop (IMW).

[10]  Shimeng Yu,et al.  Technology-design co-optimization of resistive cross-point array for accelerating learning algorithms on chip , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[11]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[12]  Barney Lee Doyle,et al.  Reactive sputtering of substoichiometric Ta2Ox for resistive memory applications , 2014 .

[13]  Hao Jiang,et al.  RENO: A high-efficient reconfigurable neuromorphic computing accelerator design , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[14]  Catherine Graves,et al.  Dot-product engine for neuromorphic computing: Programming 1T1M crossbar to accelerate matrix-vector multiplication , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[15]  Shimeng Yu,et al.  Mitigating effects of non-ideal synaptic device characteristics for on-chip learning , 2015, 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[16]  Jennifer Hasler,et al.  Finding a roadmap to achieve large neuromorphic hardware systems , 2013, Front. Neurosci..

[17]  O. Richard,et al.  10×10nm2 Hf/HfOx crossbar resistive RAM with excellent performance, reliability and low-energy operation , 2011, 2011 International Electron Devices Meeting.

[18]  Chris Yakopcic,et al.  High throughput neural network based embedded streaming multicore processors , 2016, 2016 IEEE International Conference on Rebooting Computing (ICRC).

[19]  Steven J. Plimpton,et al.  Resistive memory device requirements for a neural algorithm accelerator , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[20]  D. Gilmer,et al.  Random telegraph noise (RTN) in scaled RRAM devices , 2013, 2013 IEEE International Reliability Physics Symposium (IRPS).

[21]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[22]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[23]  Kinam Kim,et al.  A fast, high-endurance and scalable non-volatile memory device made from asymmetric Ta2O(5-x)/TaO(2-x) bilayer structures. , 2011, Nature materials.

[24]  Matthew J. Marinella,et al.  Compensating for Parasitic Voltage Drops in Resistive Memory Arrays , 2017, 2017 IEEE International Memory Workshop (IMW).

[25]  G. W. Burr,et al.  Experimental demonstration and tolerancing of a large-scale neural network (165,000 synapses), using phase-change memory as the synaptic weight element , 2015, 2014 IEEE International Electron Devices Meeting.

[26]  C. Chung,et al.  A non-linear ReRAM cell with sub-1μA ultralow operating current for high density vertical resistive memory (VRRAM) , 2012, 2012 International Electron Devices Meeting.

[27]  Sapan Agarwal,et al.  Li‐Ion Synaptic Transistor for Low Power Analog Computing , 2017, Advanced materials.

[28]  Kyeong-Sik Min,et al.  New Memristor-Based Crossbar Array Architecture with 50-% Area Reduction and 48-% Power Saving for Matrix-Vector Multiplication of Analog Neuromorphic Computing , 2014 .

[29]  Cong Xu,et al.  Design implications of memristor-based RRAM cross-point structures , 2011, 2011 Design, Automation & Test in Europe.