论文信息 - Accelerating HotSpots in Deep Neural Networks on a CAPI-Based FPGA

Accelerating HotSpots in Deep Neural Networks on a CAPI-Based FPGA

This paper introduces a new energy-efficient FPGA accelerator targeting the hotspots in Deep Neural Network (DNN) applications. Our design leverages the Coherent Accelerator Processor Interface (CAPI) which provides a coherent view of system memory to attached accelerators. Our implementation bypasses the need for device driver code and significantly reduces the communication and I/O overhead. Performance is further improved by a tiling transformation that exploits data locality in the computation kernel via the CAPI Power Service Layer (PSL) cache. A new adder tree configuration is proposed which achieves a tunable balance between resource utilization and power consumption. An implementation on a CAPI-supported Kintex FPGA board achieves up to 155 GOPs/s and 15.79 GOPs/watt, improving on the state-of-the-art of FPGA-based DNN implementations.

Apan Qasem | Semih Aslan | Md Syadus Sefat | Jeffrey W Kellington

[1] Kai Yu,et al. Large-scale deep learning at Baidu , 2013, CIKM.

[2] Kenli Li,et al. A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment , 2017, IEEE Transactions on Parallel and Distributed Systems.

[3] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[4] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[5] Asit K. Mishra,et al. From high-level deep neural models to FPGAs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6] Damien Lyonnard,et al. Parallel programming models for a multiprocessor SoC platform applied to networking and multimedia , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[7] Weisong Shi,et al. Edge Computing: Vision and Challenges , 2016, IEEE Internet of Things Journal.

[8] Junzhong Shen,et al. FPGA‐accelerated deep convolutional neural networks for high throughput and energy efficiency , 2017, Concurr. Comput. Pract. Exp..

[9] Jeffrey Stuecheli,et al. CAPI: A Coherent Accelerator Processor Interface , 2015, IBM J. Res. Dev..

[10] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11] Marc'Aurelio Ranzato,et al. Multi-GPU Training of ConvNets , 2013, ICLR.

[12] Heiner Giefers,et al. Accelerating arithmetic kernels with coherent attached FPGA coprocessors , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[13] Jason Cong,et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[14] Tao Wang,et al. Deep learning with COTS HPC systems , 2013, ICML.

[15] Yangqing Jia,et al. Learning Semantic Image Representations at a Large Scale , 2014 .

[16] Eriko Nurvitadhi,et al. Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? , 2017, FPGA.

[17] Srihari Cadambi,et al. A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[18] Yu Wang,et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[19] Eriko Nurvitadhi,et al. Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[20] Yu Cao,et al. Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[21] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22] Vivienne Sze,et al. Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[23] Berin Martini,et al. A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[24] Ming Yang,et al. DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25] Enrique S. Quintana-Ortí,et al. Accelerating the Lyapack library using GPUs , 2013, The Journal of Supercomputing.

[26] Moriyoshi Ohara,et al. A power-efficient FPGA accelerator: Systolic array with cache-coherent interface for pair-HMM algorithm , 2016, 2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX).

[27] Jason Cong,et al. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[28] Zelong Wang,et al. Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA , 2018, FPGA.