MAX2: An ReRAM-Based Neural Network Accelerator That Maximizes Data Reuse and Area Utilization

Although recent advances in resistive random access memory (ReRAM)-based accelerator designs for deep convolutional neural networks (CNNs) offer energy-efficiency improvements over CMOS-based accelerators, they have a large number of energy consuming data transactions. In this paper, we propose MAX<sup>2</sup>, a multi-tile ReRAM accelerator framework for supporting multiple CNN topologies, that maximizes on-chip data reuse and reduces on-chip bandwidth to minimize energy consumption due to data movement. Building upon the fact that a large filter can be built with a stack of smaller (<inline-formula> <tex-math notation="LaTeX">$3\times 3$ </tex-math></inline-formula>) filters, we design every tile with nine processing elements (PEs). Each PE consists of multiple ReRAM subarrays to compute the dot product. The PEs operate in a systolic fashion, thereby maximizing input feature map reuse and minimizing interconnection cost. MAX<sup>2</sup> chooses the data size granularity in the systolic array in conjunction with weight duplication to achieve very high area utilization without requiring additional peripheral circuits. We provide a detailed energy and area breakdown of each component at the PE level, tile level, and system level. The system-level evaluation in 32-nm node on several VGG-network benchmarks shows that the MAX<sup>2</sup> can improve computation efficiency (TOPs/s/mm<sup>2</sup>) by <inline-formula> <tex-math notation="LaTeX">$2.5\times $ </tex-math></inline-formula> and energy efficiency (TOPs/s/W) by <inline-formula> <tex-math notation="LaTeX">$5.2\times $ </tex-math></inline-formula> compared with a state-of-the-art ReRAM-based accelerator.

[1]  Mark Horowitz,et al.  1.1 Computing's energy problem (and what we can do about it) , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[4]  Shimeng Yu,et al.  Neuro-Inspired Computing With Emerging Nonvolatile Memorys , 2018, Proceedings of the IEEE.

[5]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[6]  Xiaochen Peng,et al.  XNOR-RRAM: A scalable and parallel resistive synaptic architecture for binary neural networks , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[7]  Tao Zhang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Xiaoming Chen,et al.  Mixed Size Crossbar based RRAM CNN Accelerator with Overlapped Mapping Method , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[10]  Leibo Liu,et al.  AEPE: An area and power efficient RRAM crossbar-based accelerator for deep CNNs , 2017, 2017 IEEE 6th Non-Volatile Memory Systems and Applications Symposium (NVMSA).

[11]  G. W. Burr,et al.  Experimental demonstration and tolerancing of a large-scale neural network (165,000 synapses), using phase-change memory as the synaptic weight element , 2015, 2014 IEEE International Electron Devices Meeting.

[12]  Xiaochen Peng,et al.  A Versatile ReRAM-based Accelerator for Convolutional Neural Networks , 2018, 2018 IEEE International Workshop on Signal Processing Systems (SiPS).

[13]  Yiran Chen,et al.  PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[14]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[15]  Shimeng Yu,et al.  A SPICE Compact Model of Metal Oxide Resistive Switching Memory With Variations , 2012, IEEE Electron Device Letters.

[16]  Yiran Chen,et al.  ReCom: An efficient resistive accelerator for compressed deep neural networks , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[17]  Yu Wang,et al.  Binary convolutional neural network on RRAM , 2017, 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC).

[18]  Huanrui Yang,et al.  AtomLayer: A Universal ReRAM-Based CNN Accelerator with Atomic Layer Computation , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[19]  Xiaochen Peng,et al.  NeuroSim+: An integrated device-to-algorithm framework for benchmarking synaptic devices and array architectures , 2017, 2017 IEEE International Electron Devices Meeting (IEDM).

[20]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[21]  Shimeng Yu,et al.  Compact Modeling of RRAM Devices and Its Applications in 1T1R and 1S1R Array Design , 2015, IEEE Transactions on Electron Devices.

[22]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[23]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[24]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[25]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[26]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[27]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[28]  Chaitali Chakrabarti,et al.  A Multilayer Approach to Designing Energy-Efficient and Reliable ReRAM Cross-Point Array System , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[29]  Jianxiong Xiao,et al.  DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Franz Franchetti,et al.  Data reorganization in memory using 3D-stacked DRAM , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[31]  H. T. Kung Why systolic architectures? , 1982, Computer.

[32]  Lin Zhong,et al.  RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[33]  Hao Yu,et al.  An energy-efficient and high-throughput bitwise CNN on sneak-path-free digital ReRAM crossbar , 2017, 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[34]  Xuehai Zhou,et al.  PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[35]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).