Flexibility: FPGAs and CAD in Deep Learning Acceleration

Deep learning inference has become the key workload to accelerate in our AI-powered world. FPGAs are an ideal platform for the acceleration of deep learning inference by combining low-latency performance, power-efficiency, and flexibility. This paper examines the flexibility aspect, and its impact on FPGA design methodology, physical design tools and CAD. We describe the degrees of flexibility required for creating efficient deep learning accelerators. We quantify the varying effects of precision, vectorization, and buffering on both performance and accuracy, and show how the FPGA can yield superior performance through architecture customization tuned for a specific neural network. We describe the need for abstraction and propose solutions in modern FPGA design flows to enable the rapid creation of these customized accelerator architectures for deep learning inference acceleration. Finally, we examine the implications on physical design tools and CAD.

[1]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Vaughn Betz,et al.  Bringing programmability to the data plane: Packet processing with a NoC-enhanced FPGA , 2015, 2015 International Conference on Field Programmable Technology (FPT).

[3]  Tarek S. Abdelrahman,et al.  Tile-based bottom-up compilation of custom mesh-of-functional-units FPGA overlays , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[4]  Andrew C. Ling,et al.  An OpenCL(TM) Deep Learning Accelerator on Arria 10 , 2017 .

[5]  John Freeman,et al.  From opencl to high-performance hardware on FPGAS , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[6]  Vaughn Betz,et al.  Take the Highway: Design for Embedded NoCs on FPGAs , 2015, FPGA.

[7]  Jinglei Huang,et al.  An Integrated Optimization Framework for Partitioning, Scheduling and Floorplanning on Partially Dynamically Reconfigurable FPGAs , 2017, ACM Great Lakes Symposium on VLSI.

[8]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[9]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[10]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[11]  Peng Zhang,et al.  Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[12]  Jing Li,et al.  Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network , 2017, FPGA.

[13]  Tarek S. Abdelrahman,et al.  A high-performance overlay architecture for pipelined execution of data flow graphs , 2013, 2013 23rd International Conference on Field programmable Logic and Applications.

[14]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15]  Andrew C. Ling,et al.  An OpenCL™ Deep Learning Accelerator on Arria 10 , 2017, FPGA.

[16]  Soonhoi Ha,et al.  Handbook of Hardware/Software Codesign , 2017, Handbook of Hardware/Software Codesign.

[17]  Donatella Sciuto,et al.  Optimization strategies in design space exploration , 2017 .

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).