XNOR-POP: A processing-in-memory architecture for binary Convolutional Neural Networks in Wide-IO2 DRAMs

It is challenging to adopt computing-intensive and parameter-rich Convolutional Neural Networks (CNNs) in mobile devices due to limited hardware resources and low power budgets. To support multiple concurrently running applications, one mobile device needs to perform multiple CNN tests simultaneously in real-time. Previous solutions cannot guarantee a high enough frame rate when serving multiple applications with reasonable hardware and power cost. In this paper, we present a novel process-in-memory architecture to process emerging binary CNN tests in Wide-IO2 DRAMs. Compared to state-of-the-art accelerators, our design improves CNN test performance by 4× ∼ 11× with small hardware and power overhead.

[1]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[3]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[4]  Cong Xu,et al.  Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[5]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[6]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[7]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[8]  Hao Yu,et al.  A 7.663-TOPS 8.2-W Energy-efficient FPGA Accelerator for Binary Convolutional Neural Networks (Abstract Only) , 2017, FPGA.

[9]  Tao Zhang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[10]  Byung Soo Kim,et al.  18.4 An 1.1V 68.2GB/s 8Gb Wide-IO2 DRAM with non-contact microbump I/O test scheme , 2016, 2016 IEEE International Solid-State Circuits Conference (ISSCC).

[11]  Luca Benini,et al.  YodaNN: An Ultra-Low Power Convolutional Neural Network Accelerator Based on Binary Weights , 2016, 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI).

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Onur Mutlu,et al.  Fast Bulk Bitwise AND and OR in DRAM , 2015, IEEE Computer Architecture Letters.

[15]  Paris Smaragdis,et al.  Bitwise Neural Networks , 2016, ArXiv.

[16]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[17]  Sanu Mathew,et al.  A 2.1GHz 6.5mW 64-bit Unified PopCount/BitScan Datapath Unit for 65nm High-Performance Microprocessor Execution Cores , 2008, 21st International Conference on VLSI Design (VLSID 2008).

[18]  J. Ticehurst Cacti , 1983 .

[19]  Igor Carron,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016 .

[20]  Yann LeCun,et al.  CNP: An FPGA-based processor for Convolutional Networks , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[21]  Alec Wolman,et al.  MCDNN: An Approximation-Based Execution Framework for Deep Stream Processing Under Resource Constraints , 2016, MobiSys.

[22]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[23]  Sudhakar Yalamanchili,et al.  Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[24]  Bruce Jacob,et al.  Memory Systems: Cache, DRAM, Disk , 2007 .

[25]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[26]  S.E. Schuster,et al.  Fast Low Power eDRAM Hierarchical Differential Sense Amplifier , 2009, IEEE Journal of Solid-State Circuits.

[27]  Yu Cao,et al.  Exploring sub-20nm FinFET design with Predictive Technology Models , 2012, DAC Design Automation Conference 2012.

[28]  Wendy Glauser,et al.  Doctors among early adopters of Google Glass , 2013, Canadian Medical Association Journal.

[29]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Vivienne Sze,et al.  14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks , 2016, ISSCC.