A Pipelined and Scalable Dataflow Implementation of Convolutional Neural Networks on FPGA

Convolutional Neural Network (CNN) is a deep learning algorithm extended from Artificial Neural Network (ANN) and widely used for image classification and recognition, thanks to its invariance to distortions. The recent rapid growth of applications based on deep learning algorithms, especially in the context of Big Data analytics, has dramatically improved both industrial and academic research and exploration of optimized implementations of CNNs on accelerators such as GPUs, FPGAs and ASICs, as general purpose processors can hardly meet the ever increasing performance and energy-efficiency requirements. FPGAs in particular are one of the most attractive alternative, as they allow the exploitation of the implicit parallelism of the algorithm and the acceleration of the different layers of a CNN with custom optimizations, while retaining extreme flexibility thanks to their reconfigurability. In this work, we propose a methodology to implement CNNs on FPGAs in a modular, scalable way. This is done by exploiting the dataflow pattern of convolutions, using an approach derived from previous work on the acceleration of Iterative Stencil Loops (ISLs), a computational pattern that shares some characteristics with convolutions. Furthermore, this approach allows the imple- mentation of a high-level pipeline between the different network layers, resulting in an increase of the overall performance when the CNN is employed to process batches of multiple images, as it would happen in real-life scenarios.

[1]  Marco D. Santambrogio,et al.  A polyhedral model-based framework for dataflow implementation on FPGA devices of Iterative Stencil Loops , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[2]  Srihari Cadambi,et al.  A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[3]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[4]  Klaus Kofler,et al.  Performance and Scalability of GPU-Based Convolutional Neural Networks , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[5]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[9]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[10]  Karin Strauss,et al.  Accelerating Deep Convolutional Neural Networks Using Specialized Hardware , 2015 .

[11]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[12]  Yann LeCun,et al.  Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[13]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[14]  Luca Maria Gambardella,et al.  Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Flexible, High Performance Convolutional Neural Networks for Image Classification , 2022 .

[15]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[16]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[17]  Henk Corporaal,et al.  Memory-centric accelerator design for Convolutional Neural Networks , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[18]  Marco D. Santambrogio,et al.  On the Automation of High Level Synthesis of Convolutional Neural Networks , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[19]  Marco D. Santambrogio,et al.  On How to Accelerate Iterative Stencil Loops , 2015, ACM Trans. Archit. Code Optim..

[20]  Kai Yu,et al.  Large-scale deep learning at Baidu , 2013, CIKM.

[21]  Yann LeCun,et al.  CNP: An FPGA-based processor for Convolutional Networks , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[22]  Srihari Cadambi,et al.  A Massively Parallel Coprocessor for Convolutional Neural Networks , 2009, 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors.

[23]  Jason Cong,et al.  Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[24]  Srihari Cadambi,et al.  A programmable parallel accelerator for learning and classification , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[25]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.