Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks

With the recent advancement of multilayer convolutional neural networks (CNN), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy-efficiency of the computation-demanding CNN, the FPGA-based acceleration emerges as one of the most attractive alternatives. In this paper we design and implement Caffeine, a hardware/software co-designed library to efficiently accelerate the entire CNN on FPGAs. First, we propose a uniformed convolutional matrix-multiplication representation for both computation-intensive convolutional layers and communication-intensive fully connected (FCN) layers. Second, we design Caffeine with the goal to maximize the underlying FPGA computing and bandwidth resource utilization, with a key focus on the bandwidth optimization by the memory access reorganization not studied in prior work. Moreover, we implement Caffeine in the portable high-level synthesis and provide various hardware/software definable parameters for user configurations. Finally, we also integrate Caffeine into the industry-standard software deep learning framework Caffe. We evaluate Caffeine and its integration with Caffe by implementing VGG16 and AlexNet network on multiple FPGA platforms. Caffeine achieves a peak performance of 365 GOPS on Xilinx KU060 FPGA and 636 GOPS on Virtex7 690t FPGA. This is the best published result to our best knowledge. We achieve more than 100× speedup on FCN layers over previous FPGA accelerators. An end-to-end evaluation with Caffe integration shows up to 7.3× and 43.5× performance and energy gains over Caffe on a 12-core Xeon server, and 1.5× better energy-efficiency over the GPU implementation on a medium-sized FPGA (KU060). Performance projections to a system with a high-end FPGA (Virtex7 690t) shows even higher gains.

[1]  Yann LeCun,et al.  CNP: An FPGA-based processor for Convolutional Networks , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[2]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[3]  Srihari Cadambi,et al.  A Massively Parallel Coprocessor for Convolutional Neural Networks , 2009, 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors.

[4]  Srihari Cadambi,et al.  A programmable parallel accelerator for learning and classification , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  Srihari Cadambi,et al.  A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[6]  Dong Yu,et al.  Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.

[7]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[8]  W. Marsden I and J , 2012 .

[9]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[10]  Henk Corporaal,et al.  Memory-centric accelerator design for Convolutional Neural Networks , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[11]  Zhi-Jie Yan,et al.  A scalable approach to using DNN-derived features in GMM-HMM based acoustic modeling for LVCSR , 2013, INTERSPEECH.

[12]  Jason Cong,et al.  Improving high level synthesis optimization opportunity through polyhedral transformations , 2013, FPGA '13.

[13]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[15]  Kai Yu,et al.  Large-scale deep learning at Baidu , 2013, CIKM.

[16]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[17]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[18]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Marc'Aurelio Ranzato,et al.  Multi-GPU Training of ConvNets , 2013, ICLR.

[20]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Sparsh Mittal,et al.  A Survey of Techniques for Managing and Leveraging Caches in GPUs , 2014, J. Circuits Syst. Comput..

[22]  Karin Strauss,et al.  Accelerating Deep Convolutional Neural Networks Using Specialized Hardware , 2015 .

[23]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[24]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[27]  Jason Cong,et al.  A quantitative analysis on microarchitectures of modern CPU-FPGA platforms , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[28]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[29]  Yu Cao,et al.  Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[30]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[31]  Jing Li,et al.  Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network , 2017, FPGA.

[32]  Viktor Prasanna,et al.  Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System , 2017, FPGA.

[33]  Peng Zhang,et al.  Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).