F-CNN: An FPGA-based framework for training Convolutional Neural Networks

This paper presents a novel reconfigurable framework for training Convolutional Neural Networks (CNNs). The proposed framework is based on reconfiguring a streaming datapath at runtime to cover the training cycle for the various layers in a CNN. The streaming datapath can support various parameterized modules which can be customized to produce implementations with different trade-offs in performance and resource usage. The modules follow the same input and output data layout, simplifying configuration scheduling. For different layers, instances of the modules contain different computation kernels in parallel, which can be customized with different layer configurations and data precision. The associated models on performance, resource and bandwidth can be used in deriving parameters for the datapath to guide the analysis of design trade-offs to meet application requirements or platform constraints. They enable estimation of the implementation specifications given different layer configurations, to maximize performance under the constraints on bandwidth and hardware resources. Experimental results indicate that the proposed module design targeting Maxeler technology can achieve a performance of 62.06 GFLOPS for 32-bit floating-point arithmetic, outperforming existing accelerators. Further evaluation based on training LeNet-5 shows that the proposed framework achieves about 4 times faster than CPU implementation of Caffe and about 7.5 times more energy efficient than the GPU implementation of Caffe.

[1]  Rafael Gadea Gironés,et al.  FPGA Implementation of a Pipelined On-Line Backpropagation , 2005, J. VLSI Signal Process..

[2]  Srihari Cadambi,et al.  A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Jagath C. Rajapakse,et al.  FPGA Implementations of Neural Networks , 2006 .

[5]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[6]  Marc'Aurelio Ranzato,et al.  Multi-GPU Training of ConvNets , 2013, ICLR.

[7]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[8]  Khaled Benkrid,et al.  Design and implementation of a 2D convolution core for video applications on FPGAs , 2002, Third International Workshop on Digital and Computational Video, 2002. DCV 2002. Proceedings..

[9]  Viktor K. Prasanna,et al.  Energy-efficient large-scale matrix multiplication on FPGAs , 2013, 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig).

[10]  Hui Zhang,et al.  A Multiwindow Partial Buffering Scheme for FPGA-Based 2-D Convolvers , 2007, IEEE Transactions on Circuits and Systems II: Express Briefs.

[11]  A. Pavasovic,et al.  A neural network FPGA implementation , 2000, Proceedings of the 5th Seminar on Neural Network Applications in Electrical Engineering. NEUREL 2000 (IEEE Cat. No.00EX287).

[12]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[13]  Karin Strauss,et al.  Accelerating Deep Convolutional Neural Networks Using Specialized Hardware , 2015 .

[14]  Srihari Cadambi,et al.  A programmable parallel accelerator for learning and classification , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[15]  S.F. Crone,,et al.  Stepwise Selection of Artificial Neural Network Models for Time Series Prediction , 2005 .

[16]  Oliver Pell,et al.  Maximum Performance Computing with Dataflow Engines , 2012, Computing in Science & Engineering.

[17]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.