MB-CNN: Memristive Binary Convolutional Neural Networks for Embedded Mobile Devices

Applications of neural networks have gained significant importance in embedded mobile devices and Internet of Things (IoT) nodes. In particular, convolutional neural networks have emerged as one of the most powerful techniques in computer vision, speech recognition, and AI applications that can improve the mobile user experience. However, satisfying all power and performance requirements of such low power devices is a significant challenge. Recent work has shown that binarizing a neural network can significantly improve the memory requirements of mobile devices at the cost of minor loss in accuracy. This paper proposes MB-CNN, a memristive accelerator for binary convolutional neural networks that perform XNOR convolution in-situ novel 2R memristive data blocks to improve power, performance, and memory requirements of embedded mobile devices. The proposed accelerator achieves at least 13.26 × , 5.91 × , and 3.18 × improvements in the system energy efficiency (computed by energy × delay) over the state-of-the-art software, GPU, and PIM architectures, respectively. The solution architecture which integrates CPU, GPU and MB-CNN outperforms every other configuration in terms of system energy and execution time.

[1]  Duncan G. Elliott,et al.  Computational RAM: Implementing Processors in Memory , 1999, IEEE Des. Test Comput..

[2]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[3]  C. H. Cheng,et al.  Ultralow Switching Energy Ni/$\hbox{GeO}_{x}$ /HfON/TaN RRAM , 2011, IEEE Electron Device Letters.

[4]  Takeyoshi Ohashi,et al.  Variability study with CD-SEM metrology for STT-MRAM: correlation analysis between physical dimensions and electrical property of the memory element , 2017, Advanced Lithography.

[5]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[6]  Yan Li,et al.  128Gb 3b/cell NAND flash memory in 19nm technology with 18MB/s write rate and 400Mb/s toggle mode , 2012, 2012 IEEE International Solid-State Circuits Conference.

[7]  M. Oskin,et al.  Active Pages: a computation model for intelligent memory , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[8]  Josep Torrellas,et al.  WearCore: A core for wearable workloads? , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[9]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[10]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[11]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[12]  Wei Wang,et al.  Highly improved resistive switching performances of the self-doped Pt/HfO2:Cu/Cu devices by atomic layer deposition , 2016 .

[13]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[14]  Sebastian Ehrlichmann Vlsi Design Techniques For Analog And Digital Circuits , 2016 .

[15]  Luca Benini,et al.  XNORBIN: A 95 TOp/s/W hardware accelerator for binary convolutional neural networks , 2018, 2018 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS).

[16]  Andrew B. Kahng,et al.  CACTI-IO: CACTI With OFF-Chip Power-Area-Timing Models , 2015, IEEE Trans. Very Large Scale Integr. Syst..

[17]  Khaled N. Salama,et al.  Memristor-based memory: The sneak paths problem and solutions , 2013, Microelectron. J..

[18]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[19]  F. Pellizzer,et al.  Novel /spl mu/trench phase-change memory cell for embedded and stand-alone non-volatile memory applications , 2004, Digest of Technical Papers. 2004 Symposium on VLSI Technology, 2004..

[20]  Eric Pop,et al.  Energy-Efficient Phase-Change Memory with Graphene as a Thermal Barrier. , 2015, Nano letters.

[21]  T. Yamamoto,et al.  Low-power embedded ReRAM technology for IoT applications , 2015, 2015 Symposium on VLSI Technology (VLSI Technology).

[22]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[23]  Payman Behnam,et al.  Accelerating $k$ -Medians Clustering Using a Novel 4T-4R RRAM Cell , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[24]  Behzad Razavi,et al.  Principles of Data Conversion System Design , 1994 .

[25]  Tao Zhang,et al.  Overcoming the challenges of crossbar resistive memory architectures , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[26]  Tao Zhang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[27]  Cong Xu,et al.  NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[28]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[29]  Moinuddin K. Qureshi,et al.  Morphable memory system: a robust architecture for exploiting multi-level phase change memories , 2010, ISCA.

[30]  Joan Bruna,et al.  Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[31]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, ArXiv.

[32]  Dara Rahmati,et al.  A Performance and Power Analysis of WK-Recursive and Mesh Networks for Network-on-Chips , 2006, 2006 International Conference on Computer Design.

[33]  Qi Liu,et al.  Super non-linear RRAM with ultra-low power for 3D vertical nano-crossbar arrays. , 2016, Nanoscale.

[34]  Shaahin Angizi,et al.  IMCE: Energy-efficient bit-wise in-memory convolution engine for deep neural network , 2018, 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC).

[35]  Tohru Ozaki,et al.  A 100 MHz Ladder FeRAM Design With Capacitance-Coupled-Bitline (CCB) Cell , 2011, IEEE Journal of Solid-State Circuits.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[38]  Mikko H. Lipasti,et al.  BenchNN: On the broad potential application scope of hardware neural network accelerators , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[39]  Minje Kim,et al.  XNOR-POP: A processing-in-memory architecture for binary Convolutional Neural Networks in Wide-IO2 DRAMs , 2017, 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[40]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[41]  Soheil Ghiasi,et al.  Fast and Energy-Efficient CNN Inference on IoT Devices , 2016, ArXiv.

[42]  Glenn Reinman,et al.  BRAINIAC: Bringing reliable accuracy into neurally-implemented approximate computing , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[43]  Tohru Ozaki,et al.  A 64-Mb Chain FeRAM With Quad BL Architecture and 200 MB/s Burst Mode , 2010, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[44]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[45]  E. Vianello,et al.  28nm advanced CMOS resistive RAM solution as embedded non-volatile memory , 2014, 2014 IEEE International Reliability Physics Symposium.

[46]  Yixin Chen,et al.  Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[47]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[48]  J. Yang,et al.  Memristive switching mechanism for metal/oxide/metal nanodevices. , 2008, Nature nanotechnology.

[49]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[50]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[51]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[52]  Zhitang Song,et al.  Superlattice-like GeTe/Sb thin film for ultra-high speed phase change memory applications , 2017 .

[53]  Engin Ipek,et al.  Memristive Boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning , 2017 .

[54]  Jose Renau,et al.  ESESC: A fast multicore simulator using Time-Based Sampling , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[55]  F. Zeng,et al.  Recent progress in resistive random access memories: Materials, switching mechanisms, and performance , 2014 .

[56]  Yu Wang,et al.  Binary convolutional neural network on RRAM , 2017, 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC).

[57]  Narayanan Vijaykrishnan,et al.  Nonvolatile Processor Architectures: Efficient, Reliable Progress with Unstable Power , 2016, IEEE Micro.

[58]  Yuan Gao,et al.  RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[59]  Yukio Hayakawa,et al.  An 8 Mb Multi-Layered Cross-Point ReRAM Macro With 443 MB/s Write Throughput , 2012, IEEE Journal of Solid-State Circuits.

[60]  Chris Yakopcic,et al.  Memristor-based neuron circuit and method for applying learning algorithm in SPICE? , 2014 .

[61]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[62]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[63]  Alexander Gruenstein,et al.  Accurate and compact large vocabulary speech recognition on mobile devices , 2013, INTERSPEECH.

[64]  Igor Carron,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016 .

[65]  Ming-Jinn Tsai,et al.  Low-Power MCU With Embedded ReRAM Buffers as Sensor Hub for IoT Applications , 2016, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[66]  Hyunsang Hwang,et al.  Materials and process aspect of cross-point RRAM (invited) , 2011 .

[67]  Cong Xu,et al.  Design trade-offs for high density cross-point resistive memory , 2012, ISLPED '12.

[68]  Hai Li,et al.  A practical low-power memristor-based analog neural branch predictor , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[69]  Cong Xu,et al.  Design implications of memristor-based RRAM cross-point structures , 2011, 2011 Design, Automation & Test in Europe.

[70]  Gang Hua,et al.  How to Train a Compact Binary Neural Network with High Accuracy? , 2017, AAAI.

[71]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[72]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[73]  Soheil Ghiasi,et al.  CNNdroid: GPU-Accelerated Execution of Trained Deep Convolutional Neural Networks on Android , 2015, ACM Multimedia.

[74]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[75]  Yiran Chen,et al.  Design Margin Exploration of Spin-Transfer Torque RAM (STT-RAM) in Scaled Technologies , 2010, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[76]  Borivoje Nikolic,et al.  A Differential 2R Crosspoint RRAM Array With Zero Standby Current , 2015, IEEE Transactions on Circuits and Systems II: Express Briefs.

[77]  Ming Yang,et al.  Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.

[78]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[79]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[80]  Walt Kester,et al.  The data conversion handbook , 2005 .

[81]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[82]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[83]  Albert Chin,et al.  Novel Ultra-low power RRAM with good endurance and retention , 2010, 2010 Symposium on VLSI Technology.

[84]  Jun Yang,et al.  A durable and energy efficient main memory using phase change memory technology , 2009, ISCA '09.

[85]  Yiran Chen,et al.  Multi-level cell STT-RAM: Is it realistic or just a dream? , 2012, 2012 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[86]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.