论文信息 - REQ-YOLO: A Resource-Aware, Efficient Quantization Framework for Object Detection on FPGAs

REQ-YOLO: A Resource-Aware, Efficient Quantization Framework for Object Detection on FPGAs

Deep neural networks (DNNs), as the basis of object detection, will play a key role in the development of future autonomous systems with full autonomy. The autonomous systems have special requirements of real-time, energy-e cient implementations of DNNs on a power-budgeted system. Two research thrusts are dedicated to per- formance and energy e ciency enhancement of the inference phase of DNNs. The first one is model compression techniques while the second is e cient hardware implementations. Recent researches on extremely-low-bit CNNs such as binary neural network (BNN) and XNOR-Net replace the traditional oating point operations with bi- nary bit operations, signi cantly reducing memory bandwidth and storage requirement, whereas suffering non-negligible accuracy loss and waste of digital signal processing (DSP) blocks on FPGAs. To overcome these limitations, this paper proposes REQ-YOLO, a resource aware, systematic weight quantization framework for object detection, considering both algorithm and hardware resource aspects in object detection. We adopt the block-circulant matrix method and propose a heterogeneous weight quantization using Alternative Direction Method of Multipliers (ADMM), an e ective optimization technique for general, non-convex optimization problems. To achieve real-time, highly efficient implementations on FPGA, we present the detailed hardware implementation of block circulant matrices on CONV layers and de- velop an e cient processing element (PE) structure supporting the heterogeneous weight quantization, CONV data ow and pipelining techniques, design optimization, and a template-based automatic synthesis framework to optimally exploit hardware resource. Experimental results show that our proposed REQ-YOLO framework can signi cantly compress the YOLO model while introducing very small accuracy degradation. The related codes are here: https://github.com/Anonymous788/heterogeneous_ADMM_YOLO.

[1] Philip Heng Wai Leong,et al. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference , 2016, FPGA.

[2] Zhiyong Gao,et al. Hardware Implementation and Optimization of Tiny-YOLO Network , 2017, IFTC.

[3] Wayne Luk,et al. Comparing performance and energy efficiency of FPGAs and GPUs for high productivity computing , 2010, 2010 International Conference on Field-Programmable Technology.

[4] Yun Liang,et al. SpWA: An Efficient Sparse Winograd Convolutional Neural Networks Accelerator on FPGAs , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[5] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[6] Ali Farhadi,et al. You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Chao Wang,et al. CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-Circulant Weight Matrices , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8] Keshab K. Parhi,et al. An efficient pipelined FFT architecture , 2003 .

[9] Yann LeCun,et al. Deep learning & convolutional networks , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[10] J. Tukey,et al. An algorithm for the machine calculation of complex Fourier series , 1965 .

[11] Jason Cong,et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[12] Ali Farhadi,et al. YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Yun Liang,et al. A comprehensive framework for synthesizing stencil algorithms on FPGAs using OpenCL model , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[14] Hiroki Nakahara,et al. A Lightweight YOLOv2: A Binarized CNN with A Parallel Support Vector Regression for an FPGA , 2018, FPGA.

[15] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16] Rogério Schmidt Feris,et al. A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection , 2016, ECCV.

[17] Luc Van Gool,et al. The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[18] Wei Zhang,et al. FlexCL: An analytical performance model for OpenCL workloads on flexible FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[19] Yiran Chen,et al. Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[20] Yanzhi Wang,et al. Theoretical Properties for Neural Networks with Weight Matrices of Low Displacement Rank , 2017, ICML.

[21] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22] Sachin S. Talathi,et al. Fixed Point Quantization of Deep Convolutional Networks , 2015, ICML.

[23] Qinru Qiu,et al. C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs , 2018, FPGA.

[24] Peng Zhang,et al. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[25] Peng Zhang,et al. TGPA: Tile-Grained Pipeline Architecture for Low Latency CNN Inference , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[26] Yu Wang,et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[27] Jian Cheng,et al. Quantized Convolutional Neural Networks for Mobile Devices , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Rong Jin,et al. Deep Learning at Alibaba , 2017, IJCAI.

[29] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.

[30] Ali Farhadi,et al. YOLOv3: An Incremental Improvement , 2018, ArXiv.

[31] Smith,et al. Mathematics of the Discrete Fourier Transform (DFT) with Audio Applications , 2007 .

[32] Yoshua Bengio,et al. Neural Networks with Few Multiplications , 2015, ICLR.

[33] Shengen Yan,et al. Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[34] Jian Sun,et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35] Keshab K. Parhi,et al. Pipelined Architectures for Real-Valued FFT and Hermitian-Symmetric IFFT With Real Datapaths , 2013, IEEE Transactions on Circuits and Systems II: Express Briefs.

[36] Sergey Ioffe,et al. Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models , 2017, NIPS.

[37] Stephen P. Boyd,et al. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[38] Song Han,et al. From model to FPGA: Software-hardware co-design for efficient neural network acceleration , 2016, 2016 IEEE Hot Chips 28 Symposium (HCS).

[39] Farid Melgani,et al. Convolutional SVM Networks for Object Detection in UAV Imagery , 2018, IEEE Transactions on Geoscience and Remote Sensing.

[40] Shengen Yan,et al. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[41] Manoj Alwani,et al. Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[42] Igor Carron,et al. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016 .

[43] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[44] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[46] Yun Liang,et al. High-Level Synthesis: Productivity, Performance, and Software Constraints , 2012, J. Electr. Comput. Eng..

[47] V. Pan. Structured Matrices and Polynomials: Unified Superfast Algorithms , 2001 .

[48] Yoshua Bengio,et al. BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[49] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[50] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Song Han,et al. ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA , 2016, FPGA.