Accelerating HotSpots in Deep Neural Networks on a CAPI-Based FPGA

This paper introduces a new energy-efficient FPGA accelerator targeting the hotspots in Deep Neural Network (DNN) applications. Our design leverages the Coherent Accelerator Processor Interface (CAPI) which provides a coherent view of system memory to attached accelerators. Our implementation bypasses the need for device driver code and significantly reduces the communication and I/O overhead. Performance is further improved by a tiling transformation that exploits data locality in the computation kernel via the CAPI Power Service Layer (PSL) cache. A new adder tree configuration is proposed which achieves a tunable balance between resource utilization and power consumption. An implementation on a CAPI-supported Kintex FPGA board achieves up to 155 GOPs/s and 15.79 GOPs/watt, improving on the state-of-the-art of FPGA-based DNN implementations.

[1]  Kai Yu,et al.  Large-scale deep learning at Baidu , 2013, CIKM.

[2]  Kenli Li,et al.  A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment , 2017, IEEE Transactions on Parallel and Distributed Systems.

[3]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[4]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[5]  Asit K. Mishra,et al.  From high-level deep neural models to FPGAs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Damien Lyonnard,et al.  Parallel programming models for a multiprocessor SoC platform applied to networking and multimedia , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[7]  Weisong Shi,et al.  Edge Computing: Vision and Challenges , 2016, IEEE Internet of Things Journal.

[8]  Junzhong Shen,et al.  FPGA‐accelerated deep convolutional neural networks for high throughput and energy efficiency , 2017, Concurr. Comput. Pract. Exp..

[9]  Jeffrey Stuecheli,et al.  CAPI: A Coherent Accelerator Processor Interface , 2015, IBM J. Res. Dev..

[10]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Marc'Aurelio Ranzato,et al.  Multi-GPU Training of ConvNets , 2013, ICLR.

[12]  Heiner Giefers,et al.  Accelerating arithmetic kernels with coherent attached FPGA coprocessors , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[13]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[14]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[15]  Yangqing Jia,et al.  Learning Semantic Image Representations at a Large Scale , 2014 .

[16]  Eriko Nurvitadhi,et al.  Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? , 2017, FPGA.

[17]  Srihari Cadambi,et al.  A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[18]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[19]  Eriko Nurvitadhi,et al.  Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[20]  Yu Cao,et al.  Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[21]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[23]  Berin Martini,et al.  A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[24]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Enrique S. Quintana-Ortí,et al.  Accelerating the Lyapack library using GPUs , 2013, The Journal of Supercomputing.

[26]  Moriyoshi Ohara,et al.  A power-efficient FPGA accelerator: Systolic array with cache-coherent interface for pair-HMM algorithm , 2016, 2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX).

[27]  Jason Cong,et al.  Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[28]  Zelong Wang,et al.  Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA , 2018, FPGA.