Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Convolutional neural networks (CNN) are the current stateof-the-art for many computer vision tasks. CNNs outperform older methods in accuracy, but require vast amounts of computation and memory. As a result, existing CNN applications are typically run on clusters of CPUs or GPUs. Studies into the FPGA acceleration of CNN workloads has achieved reductions in power and energy consumption. However, large GPUs outperform modern FPGAs in throughput, and the existence of compatible deep learning frameworks give GPUs a significant advantage in programmability. Recent research in machine learning demonstrates the potential of very low precision CNNs -- i.e., CNNs with binarized weights and activations. Such binarized neural networks (BNNs) appear well suited for FPGA implementation, as their dominant computations are bitwise logic operations and their memory requirements are reduced. A combination of low-precision networks and high-level design methodology may help address the performance and productivity gap between FPGAs and GPUs. In this paper, we present the design of a BNN accelerator that is synthesized from C++ to FPGA-targeted Verilog. The accelerator outperforms existing FPGA-based CNN accelerators in GOPS as well as energy and resource efficiency.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[3]  Bin Liu,et al.  Ternary Weight Networks , 2016, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Vivienne Sze,et al.  14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks , 2016, ISSCC.

[6]  John Freeman,et al.  From opencl to high-performance hardware on FPGAS , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[7]  Jason Cong,et al.  High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[8]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[9]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[10]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Christos-Savvas Bouganis,et al.  fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jason Helge Anderson,et al.  LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems , 2013, TECS.

[15]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[16]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[17]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[18]  Vinod Kathail,et al.  SDSoC: A Higher-level Programming Environment for Zynq SoC and Ultrascale+ MPSoC , 2016, FPGA.

[19]  Karin Strauss,et al.  Accelerating Deep Convolutional Neural Networks Using Specialized Hardware , 2015 .

[20]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[21]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[22]  Yu Cao,et al.  Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[23]  Kiyoung Choi,et al.  Efficient FPGA acceleration of Convolutional Neural Networks using logical-3D compute array , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[24]  Shuchang Zhou,et al.  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[25]  Jie Xu,et al.  DeepBurning: Automatic generation of FPGA-based learning accelerators for the Neural Network family , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[26]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, NIPS.

[27]  Soheil Ghiasi,et al.  Design space exploration of FPGA-based Deep Convolutional Neural Networks , 2016, 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC).

[28]  Igor Carron,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016 .