LACS: A High-Computational-Efficiency Accelerator for CNNs

Convolutional neural networks (CNNs) have become continually deeper. With the increasing depth of CNNs, the invalid calculations caused by padding-zero operations, filling-zero operations and stride length (stride length>1) represent an increasing proportion of all calculations. To adapt to different CNNs and to eliminate the influences of padding-zero operations, filling-zero operations and stride length on the computational efficiency of the accelerator, we draw upon the computation pattern of CPUs to design an efficient and versatile CNN accelerator, LACS (Loading-Addressing-Computing-Storing). We reduce the amount of data movements between registers and the on-chip buffer from O( $\mathrm {k}\times \mathrm {k}$ ) to O(k) by a bypass buffer mechanism. Finally, we deploy LACS on a field-programmable gate array (FPGA) chip and analyze the factors that affect the computational efficiency of LACS. We also run popular CNNs on LACS. The results show that LACS achieves an extremely high computational efficiency, 98.51% when executing AlexNet and 99.66% when executing VGG-16, significantly exceeding state-of-the-art accelerators.

[1]  Yiran Chen,et al.  ReGAN: A pipelined ReRAM-based accelerator for generative adversarial networks , 2018, 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC).

[2]  V. Prasanna,et al.  Optimizing Frequency Domain Implementation of CNNs on FPGAs , 2017 .

[3]  Eugenio Culurciello,et al.  Snowflake: A Model Agnostic Accelerator for Deep Convolutional Neural Networks , 2017, ArXiv.

[4]  Tao Li,et al.  Towards Efficient Microarchitectural Design for Accelerating Unsupervised GAN-Based Deep Learning , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[5]  Christos-Savvas Bouganis,et al.  Cascade^CNN: Pushing the Performance Limits of Quantisation in Convolutional Neural Networks , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[6]  Yao Chen,et al.  Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs , 2019, FPGA.

[7]  Yu Cao,et al.  Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[8]  Aidong Men,et al.  G-CNN: Object Detection via Grid Convolutional Neural Network , 2017, IEEE Access.

[9]  Shengen Yan,et al.  Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[10]  Yu Cao,et al.  Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[11]  Jiwu Shu,et al.  LerGAN: A Zero-Free, Low Data Movement and PIM-Based GAN Architecture , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[13]  Xuegong Zhou,et al.  Accelerating low bit-width convolutional neural networks with embedded FPGA , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[14]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[15]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[16]  Jascha Sohl-Dickstein,et al.  Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10, 000-Layer Vanilla Convolutional Neural Networks , 2018, ICML.

[17]  Thierry Moreau,et al.  Graph Optimizer Tensor Optimizer VTA JIT Runtime VTA ISA VTA MicroArchitecture , 2018 .

[18]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[19]  Viktor Prasanna,et al.  Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System , 2017, FPGA.

[20]  Yiran Chen,et al.  ZARA: A Novel Zero-free Dataflow Accelerator for Generative Adversarial Networks in 3D ReRAM , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[21]  Ping Wang,et al.  MPNET: An End-to-End Deep Neural Network for Object Detection in Surveillance Video , 2018, IEEE Access.

[22]  Vaughn Betz,et al.  Math Doesn't Have to be Hard: Logic Block Architectures to Enhance Low-Precision Multiply-Accumulate on FPGAs , 2019, FPGA.

[23]  Yu Cao,et al.  Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks , 2017, FPGA.

[24]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Zhongfeng Wang,et al.  An Energy-Efficient Architecture for Binary Weight Convolutional Neural Networks , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[26]  Zhi Zhang,et al.  Fast Deep Neural Networks With Knowledge Guided Training and Predicted Regions of Interests for Real-Time Video Object Detection , 2018, IEEE Access.

[27]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[28]  Viktor K. Prasanna,et al.  A Framework for Generating High Throughput CNN Implementations on FPGAs , 2018, FPGA.

[29]  Yu Wang,et al.  Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[30]  Ahmet Burak Can,et al.  Volumetric Object Recognition Using 3-D CNNs on Depth Data , 2018, IEEE Access.