An Energy-Efficient Deep Convolutional Neural Network Inference Processor With Enhanced Output Stationary Dataflow in 65-nm CMOS

We propose a deep convolutional neural network (CNN) inference processor based on a novel enhanced output stationary (EOS) dataflow. Based on the observation that some activations are commonly used in two successive convolutions, the EOS dataflow employs dedicated register files (RFs) for storing such reused activation data to eliminate redundant memory accesses for highly energy-consuming SRAM banks. In addition, processing elements (PEs) are split into multiple small groups such that each group covers a tile of input activation map to increase the usability of activation RFs (ARFs). The processor has two different voltage/frequency domains. The computation domain with 512 PEs operates at near-threshold voltage (NTV) (0.4 V) and 60-MHz frequency to increase energy efficiency, while the rest of the processors including 848-KB SRAMs run at 0.7 V and 120-MHz frequency to increase both on-chip and off-chip memory bandwidths. The measurement results show that our processor is capable of running AlexNet at 831 GOPS/W, VGG-16 at 1151 GOPS/W, ResNet-18 at 1004 GOPS/W, and MobileNet at 948 GOPS/W energy efficiency.

[1]  Alberto Delmas,et al.  Bit-Tactical: Exploiting Ineffectual Computations in Convolutional Neural Networks: Which, Why, and How , 2018, ArXiv.

[2]  Vivienne Sze,et al.  14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks , 2016, ISSCC.

[3]  Dennis Sylvester,et al.  Pushing ASIC performance in a power envelope , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[4]  A. Chandrakasan,et al.  A 256kb Sub-threshold SRAM in 65nm CMOS , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[5]  Hoi-Jun Yoo,et al.  14.2 DNPU: An 8.1TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networks , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[6]  Soheil Ghiasi,et al.  Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[7]  David Blaauw,et al.  Near-Threshold Computing: Reclaiming Moore's Law Through Energy Efficient Integrated Circuits , 2010, Proceedings of the IEEE.

[8]  Mark Horowitz,et al.  1.1 Computing's energy problem (and what we can do about it) , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[9]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[10]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[11]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[12]  Seokhyeong Kang,et al.  Synthesis of Dual-Mode Circuits Through Library Design, Gate Sizing, and Clock-Tree Optimization , 2016, ACM Trans. Design Autom. Electr. Syst..

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Jun-Seok Park,et al.  14.6 A 1.42TOPS/W deep convolutional neural network recognition processor for intelligent IoE systems , 2016, 2016 IEEE International Solid-State Circuits Conference (ISSCC).

[15]  Trevor Mudge,et al.  1 A 2 . 9 TOPS / W Deep Convolutional Neural Network SoC in FD-SOI 28 nm for Intelligent Embedded Systems , 2017 .

[16]  Dong Han,et al.  Cambricon: An Instruction Set Architecture for Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[17]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[18]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[19]  Marian Verhelst,et al.  14.5 Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable Convolutional Neural Network processor in 28nm FDSOI , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[20]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Nitin Chawla,et al.  14.1 A 2.9TOPS/W deep convolutional neural network SoC in FD-SOI 28nm for intelligent embedded systems , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[22]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[23]  Leibo Liu,et al.  Deep Convolutional Neural Network Architecture With Reconfigurable Computation Patterns , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[24]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[25]  Raquel Urtasun,et al.  Efficient Deep Learning for Stereo Matching , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Hyo-Eun Kim,et al.  A novel approach for tuberculosis screening based on deep convolutional neural networks , 2016, SPIE Medical Imaging.

[27]  Mark Anders,et al.  Near-threshold voltage (NTV) design — Opportunities and challenges , 2012, DAC Design Automation Conference 2012.

[28]  Hoi-Jun Yoo,et al.  UNPU: A 50.6TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[29]  Jason Cong,et al.  Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[30]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[31]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Jihyuck Jo,et al.  DSIP: A Scalable Inference Accelerator for Convolutional Neural Networks , 2018, IEEE Journal of Solid-State Circuits.