论文信息 - NEURAghe

NEURAghe

Deep convolutional neural networks (CNNs) obtain outstanding results in tasks that require human-level understanding of data, like image or speech recognition. However, their computational load is significant, motivating the development of CNN-specialized accelerators. This work presents NEURAghe, a flexible and efficient hardware/software solution for the acceleration of CNNs on Zynq SoCs. NEURAghe leverages the synergistic usage of Zynq ARM cores and of a powerful and flexible Convolution-Specific Processor deployed on the reconfigurable logic. The Convolution-Specific Processor embeds both a convolution engine and a programmable soft core, releasing the ARM processors from most of the supervision duties and allowing the accelerator to be controlled by software at an ultra-fine granularity. This methodology opens the way for cooperative heterogeneous computing: While the accelerator takes care of the bulk of the CNN workload, the ARM cores can seamlessly execute hard-to-accelerate parts of the computational graph, taking advantage of the NEON vector engines to further speed up computation. Through the companion NeuDNN SW stack, NEURAghe supports end-to-end CNN-based classification with a peak performance of 169GOps/s, and an energy efficiency of 17GOps/W. Thanks to our heterogeneous computing model, our platform improves upon the state-of-the-art, achieving a frame rate of 5.5 frames per second (fps) on the end-to-end execution of VGG-16 and 6.6fps on ResNet-18.

[1] Ming Yang,et al. DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2] Luca Benini,et al. A fully-synthesizable single-cycle interconnection network for Shared-L1 processor clusters , 2011, 2011 Design, Automation & Test in Europe.

[3] Luca Benini,et al. Curbing the roofline: a scalable and flexible architecture for CNNs on FPGA , 2016, Conf. Computing Frontiers.

[4] Mark Zastrow. Machine outsmarts man in battle of the decade , 2016 .

[5] Luca Benini,et al. Runtime Support for Multiple Offload-Based Programming Models on Clustered Manycore Accelerators , 2018, IEEE Transactions on Emerging Topics in Computing.

[6] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[7] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[8] Jason Cong,et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[9] Jason Cong,et al. Caffeine: Toward Uniformed Representation and Acceleration for Deep Convolutional Neural Networks , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[10] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Luca Benini,et al. A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[13] Yu Wang,et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[14] Nicholas Caldwell,et al. Scalable high-performance architecture for convolutional ternary neural networks on FPGA , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[15] Fernanda Lima Kastensmidt,et al. Heavy Ions Induced Single Event Upsets Testing of the 28 nm Xilinx Zynq-7000 All Programmable SoC , 2015, 2015 IEEE Radiation Effects Data Workshop (REDW).

[16] Luca Benini,et al. Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes , 2017, IEEE Transactions on Parallel and Distributed Systems.

[17] Tianshi Chen,et al. ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[18] Joel Emer,et al. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[19] Luca Benini,et al. Ultra-low-latency lightweight DMA for tightly coupled multi-core clusters , 2014, Conf. Computing Frontiers.

[20] Philip Heng Wai Leong,et al. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference , 2016, FPGA.

[21] Shengen Yan,et al. Deep Image: Scaling up Image Recognition , 2015, ArXiv.

[22] Luca Benini,et al. Origami: A 803-GOp/s/W Convolutional Network Accelerator , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[23] Jason Weston,et al. Memory Networks , 2014, ICLR.

[24] Luca Benini,et al. Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[25] Jason Cong,et al. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[26] Jason Weston,et al. Dialog-based Language Learning , 2016, NIPS.

[27] S. M. García,et al. 2014: , 2020, A Party for Lazarus.

[28] Eugenio Culurciello,et al. Snowflake: An efficient hardware accelerator for convolutional neural networks , 2017, 2017 IEEE International Symposium on Circuits and Systems (ISCAS).

[29] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[30] Yoshua Bengio,et al. BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[31] Yu Cao,et al. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[32] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[33] Luca Benini,et al. An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics , 2016, IEEE Transactions on Circuits and Systems I: Regular Papers.

[34] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[35] Christos-Savvas Bouganis,et al. Latency-driven design for FPGA-based convolutional neural networks , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[36] Forrest N. Iandola,et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[37] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.

[38] Florence March,et al. 2016 , 2016, Affair of the Heart.

[39] Loukas P. Petrou,et al. SqueezeJet: High-Level Synthesis Accelerator Design for Deep Convolutional Neural Networks , 2018, ARC.