论文信息 - Towards efficient deep neural network training by FPGA-based batch-level parallelism

Towards efficient deep neural network training by FPGA-based batch-level parallelism

Training Deep Neural Networks (DNNs) requires a significant amount of time and resources to obtain acceptable results, which severely limits its deployment in resource-limited platforms. This paper proposes DarkFPGA, a novel customizable framework to efficiently accelerate the entire DNN training on a single FPGA platform. First, we explore batch-level parallelism to enable efficient training on FPGAs. Second, we devise a novel hardware architecture optimised by a batch-oriented data pattern and tiling techniques to effectively exploit parallelism. Moreover, an analytical model is developed to determine the optimal design parameters for the DarkFPGA accelerator with respect to a specific network specification and FPGA resource constraints. Our results show that the accelerator is able to perform about 11 times faster than CPU training and about a third of the energy consumption than GPU training using 8-bit integers for training VGG-like networks on the CIFAR dataset for the Maxeler MAX5 platform.

[1] Wayne Luk,et al. Customised pearlmutter propagation: A hardware architecture for trust region policy optimisation , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[2] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[3] John Tran,et al. cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[4] Amar Phanishayee,et al. Benchmarking and Analyzing Deep Neural Network Training , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[5] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[6] Oliver Pell,et al. Maximum Performance Computing with Dataflow Engines , 2012, Computing in Science & Engineering.

[7] Guangwen Yang,et al. F-CNN: An FPGA-based framework for training Convolutional Neural Networks , 2016, 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[8] Shuang Wu,et al. Training and Inference with Integers in Deep Neural Networks , 2018, ICLR.

[9] Soheil Ghiasi,et al. Design space exploration of FPGA-based Deep Convolutional Neural Networks , 2016, 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC).

[10] Eriko Nurvitadhi,et al. A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study , 2018, FPGA.

[11] Stefan Wermter,et al. Continual Lifelong Learning with Neural Networks: A Review , 2019, Neural Networks.

[12] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[13] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.

[14] Steven J. E. Wilton,et al. Simultaneous Inference and Training Using On-FPGA Weight Perturbation Techniques , 2018, 2018 International Conference on Field-Programmable Technology (FPT).

[15] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[16] Kunle Olukotun,et al. High-Accuracy Low-Precision Training , 2018, ArXiv.

[17] Chen Yang,et al. FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[18] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19] Patrice Y. Simard,et al. Using GPUs for machine learning algorithms , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[20] Ardavan Pedram,et al. CATERPILLAR: Coarse Grain Reconfigurable Architecture for accelerating the training of Deep Neural Networks , 2017, 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP).