Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs

Convolutional neural networks (CNNs) have been widely applied in many deep learning applications. In recent years, the FPGA implementation for CNNs has attracted much attention because of its high performance and energy efficiency. However, existing implementations have difficulty to fully leverage the computation power of the latest FPGAs. In this paper we implement CNN on an FPGA using a systolic array architecture, which can achieve high clock frequency under high resource utilization. We provide an analytical model for performance and resource utilization and develop an automatic design space exploration framework, as well as source-to-source code transformation from a C program to a CNN implementation using systolic array. The experimental results show that our framework is able to generate the accelerator for real-life CNN models, achieving up to 461 GFlops for floating point data type and 1.2 Tops for 8–16 bit fixed point.

[1]  S. Winograd Arithmetic complexity of computations , 1980 .

[2]  Viktor Prasanna,et al.  Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System , 2017, FPGA.

[3]  Andrew C. Ling,et al.  An OpenCL™ Deep Learning Accelerator on Arria 10 , 2017, FPGA.

[4]  David L. Kuck,et al.  The Structure of Computers and Computations , 1978 .

[5]  Srihari Cadambi,et al.  A Massively Parallel Coprocessor for Convolutional Neural Networks , 2009, 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors.

[6]  Roger D. Chamberlain,et al.  Design of throughput-optimized arrays from recurrence abstractions , 2010, ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors.

[7]  Jason Cong,et al.  Customizable and High Performance Matrix Multiplication Kernel on FPGA (Abstract Only) , 2015, FPGA.

[8]  Yann LeCun,et al.  CNP: An FPGA-based processor for Convolutional Networks , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[9]  Jason Cong,et al.  Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Henk Corporaal,et al.  Memory-centric accelerator design for Convolutional Neural Networks , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[13]  Yu Cao,et al.  Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks , 2017, FPGA.

[14]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[15]  Yu Cao,et al.  Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[16]  Srihari Cadambi,et al.  A programmable parallel accelerator for learning and classification , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17]  Christos-Savvas Bouganis,et al.  fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[18]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[19]  Jing Li,et al.  Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network , 2017, FPGA.

[20]  Shengen Yan,et al.  Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[21]  Sven Verdoolaege,et al.  isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.

[22]  Srihari Cadambi,et al.  A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[23]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[24]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).