Light-OPU: An FPGA-based Overlay Processor for Lightweight Convolutional Neural Networks

Lightweight convolutional neural networks (LW-CNNs) such as MobileNet, ShuffleNet, SqueezeNet, etc., have emerged in the past few years for fast inference on embedded and mobile system. However, lightweight operations limit acceleration potential by GPU due to their memory bounded nature and their parallel mechanisms that are not friendly to SIMD. This calls for more specific accelerators. In this paper, we propose an FPGA-based overlay processor with a corresponding compilation flow for general LW-CNN accelerations, called Light-OPU. Software-hardware co-designed Light-OPU reformulates and decomposes lightweight operations for efficient acceleration. Moreover, our instruction architecture considers sharing of major computation engine between LW operations and conventional convolution operations. This improves the run-time resource efficiency and overall power efficiency. Finally, Light-OPU is software programmable, since loading of compiled codes and kernel weights completes switch of targeted network without FPGA reconfiguration. Our experiments on seven major LW-CNNs show that Light-OPU achieves 5.5x better latency and 3.0x higher power efficiency on average compared with edge GPU NVIDIA Jetson TX2. Furthermore, Light-OPU has 1.3x to 8.4x better power efficiency compared with previous customized FPGA accelerators. To the best of our knowledge, Light-OPU is the first in-depth study on FPGA-based general processor for LW-CNNs acceleration with high performance and power efficiency, which is evaluated using all major LW-CNNs including the newly released MobileNetV3.

[1]  Quoc V. Le,et al.  Searching for MobileNetV3 , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[3]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[4]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Eunhyeok Park,et al.  Value-aware Quantization for Training and Inference of Neural Networks , 2018, ECCV.

[6]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Loukas P. Petrou,et al.  Expanding a robot's life: Low power object recognition via FPGA-based DCNN deployment , 2018, 2018 7th International Conference on Modern Circuits and Systems Technologies (MOCAST).

[10]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[11]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Xiangyu Zhang,et al.  ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design , 2018, ECCV.

[14]  Peng Zhang,et al.  Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[15]  Srihari Cadambi,et al.  A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Asit K. Mishra,et al.  From high-level deep neural models to FPGAs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[19]  Christos-Savvas Bouganis,et al.  fpgaConvNet: Mapping Regular and Irregular Convolutional Neural Networks on FPGAs , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Jason Cong,et al.  Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[21]  Xinming Huang,et al.  A CNN Accelerator on FPGA Using Depthwise Separable Convolution , 2018, IEEE Transactions on Circuits and Systems II: Express Briefs.

[22]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[23]  Yu Cao,et al.  An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[24]  Karin Strauss,et al.  Accelerating Deep Convolutional Neural Networks Using Specialized Hardware , 2015 .

[25]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[26]  Masoud Daneshtalab,et al.  ADONN: Adaptive Design of Optimized Deep Neural Networks for Embedded Systems , 2018, 2018 21st Euromicro Conference on Digital System Design (DSD).

[27]  ChakradharSrimat,et al.  A dynamically configurable coprocessor for convolutional neural networks , 2010 .

[28]  Luciano Lavagno,et al.  Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs , 2018, FPGA.

[29]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Lei He,et al.  OPU: An FPGA-Based Overlay Processor for Convolutional Neural Networks , 2020, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[31]  Ajith Pasqual,et al.  EdgeNet: SqueezeNet like Convolution Neural Network on Embedded FPGA , 2018, 2018 25th IEEE International Conference on Electronics, Circuits and Systems (ICECS).

[32]  Yann LeCun,et al.  CNP: An FPGA-based processor for Convolutional Networks , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[33]  Yu Cao,et al.  Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[34]  Seungwon Lee,et al.  Quantization for Rapid Deployment of Deep Neural Networks , 2018, ArXiv.

[35]  Wayne Luk,et al.  Automatic Optimising CNN with Depthwise Separable Convolution on FPGA: (Abstact Only) , 2018, FPGA.

[36]  David B. Thomas,et al.  Redundancy-Reduced MobileNet Acceleration on Reconfigurable Logic for ImageNet Classification , 2018, ARC.

[37]  Forrest N. Iandola,et al.  DenseNet: Implementing Efficient ConvNet Descriptor Pyramids , 2014, ArXiv.

[38]  Junzhong Shen,et al.  An Efficient Design Flow for Accelerating Complicated-connected CNNs on a Multi-FPGA Platform , 2019, ICPP.

[39]  Srihari Cadambi,et al.  A programmable parallel accelerator for learning and classification , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[40]  Lei He,et al.  Overview of a FPGA-Based Overlay Processor , 2019, 2019 China Semiconductor Technology International Conference (CSTIC).

[41]  Chen Feng,et al.  A Quantization-Friendly Separable Convolution for MobileNets , 2018, 2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2).