A High Throughput Acceleration for Hybrid Neural Networks With Efficient Resource Management on FPGA

Deep learning is the amazing technology which has promoted the development of artificial intelligence and achieved many amazing successes in intelligent fields. Convolution-based layers (CLs), fully connected layers (FLs) and recurrent layers (RLs) are three types of layers in classic neural networks. Most intelligent tasks are implemented by the hybrid neural networks (hybrid-NNs), which are commonly composed of different layer-blocks (LBs) of CLs, FLs, and RLs. Because the CLs require the most computation in hybrid-NNs, many field-programmable gate array (FPGA)-based accelerators focus on CLs acceleration and have demonstrated great performance. However, the CLs accelerators lead to an underutilization of FPGA resources in the acceleration of the whole hybrid-NN. To fully exploit the logic resources and the memory bandwidth in the acceleration of CLs/FLs/RLs, we propose an FPGA resource efficient mapping mechanism for hybrid-NNs. The mechanism first improves the utilization of DSPs by integrating multiple small bit-width operations on one DSP. Then the LB-level spatial mapping is used to exploit the complementary features between different neural networks in the hybrid-NN. We evaluate the mapping mechanism by implementing four hybrid-NNs on Xilinx Virtex7 690T FPGA. The proposed mechanism achieves a peak performance of 1805.8 giga operations per second (GOPs). With the analysis on resource utilization and throughput, the proposed method exploits more computing power in FPGA and achieves up to $4.13 \times$ higher throughput than the state-of-the-art acceleration.

[1]  Michael Ferdman,et al.  Overcoming resource underutilization in spatial CNN accelerators , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[2]  Jason Cong,et al.  Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Karin Strauss,et al.  Accelerating Deep Convolutional Neural Networks Using Specialized Hardware , 2015 .

[5]  Yu Cao,et al.  Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[6]  Yann LeCun,et al.  CNP: An FPGA-based processor for Convolutional Networks , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[7]  Ming Yang,et al.  Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[10]  Philipp Gysel,et al.  Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks , 2016, ArXiv.

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Giacomo Indiveri,et al.  Rounding Methods for Neural Networks with Low Resolution Synaptic Weights , 2015, ArXiv.

[13]  Henk Corporaal,et al.  Memory-centric accelerator design for Convolutional Neural Networks , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[16]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[17]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[19]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Jian Cheng,et al.  Quantized Convolutional Neural Networks for Mobile Devices , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Rick Siow Mong Goh,et al.  Multi-Modal Hybrid Deep Neural Network for Speech Enhancement , 2016, ArXiv.

[22]  Chrisantha Fernando,et al.  PathNet: Evolution Channels Gradient Descent in Super Neural Networks , 2017, ArXiv.

[23]  Kiyoung Choi,et al.  Efficient FPGA acceleration of Convolutional Neural Networks using logical-3D compute array , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[24]  Srihari Cadambi,et al.  A programmable parallel accelerator for learning and classification , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[25]  Srihari Cadambi,et al.  A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[26]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.