NEURAghe

Deep convolutional neural networks (CNNs) obtain outstanding results in tasks that require human-level understanding of data, like image or speech recognition. However, their computational load is significant, motivating the development of CNN-specialized accelerators. This work presents NEURAghe, a flexible and efficient hardware/software solution for the acceleration of CNNs on Zynq SoCs. NEURAghe leverages the synergistic usage of Zynq ARM cores and of a powerful and flexible Convolution-Specific Processor deployed on the reconfigurable logic. The Convolution-Specific Processor embeds both a convolution engine and a programmable soft core, releasing the ARM processors from most of the supervision duties and allowing the accelerator to be controlled by software at an ultra-fine granularity. This methodology opens the way for cooperative heterogeneous computing: While the accelerator takes care of the bulk of the CNN workload, the ARM cores can seamlessly execute hard-to-accelerate parts of the computational graph, taking advantage of the NEON vector engines to further speed up computation. Through the companion NeuDNN SW stack, NEURAghe supports end-to-end CNN-based classification with a peak performance of 169GOps/s, and an energy efficiency of 17GOps/W. Thanks to our heterogeneous computing model, our platform improves upon the state-of-the-art, achieving a frame rate of 5.5 frames per second (fps) on the end-to-end execution of VGG-16 and 6.6fps on ResNet-18.

[1]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Luca Benini,et al.  A fully-synthesizable single-cycle interconnection network for Shared-L1 processor clusters , 2011, 2011 Design, Automation & Test in Europe.

[3]  Luca Benini,et al.  Curbing the roofline: a scalable and flexible architecture for CNNs on FPGA , 2016, Conf. Computing Frontiers.

[4]  Mark Zastrow Machine outsmarts man in battle of the decade , 2016 .

[5]  Luca Benini,et al.  Runtime Support for Multiple Offload-Based Programming Models on Clustered Manycore Accelerators , 2018, IEEE Transactions on Emerging Topics in Computing.

[6]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[7]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[8]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[9]  Jason Cong,et al.  Caffeine: Toward Uniformed Representation and Acceleration for Deep Convolutional Neural Networks , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Luca Benini,et al.  A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[13]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[14]  Nicholas Caldwell,et al.  Scalable high-performance architecture for convolutional ternary neural networks on FPGA , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[15]  Fernanda Lima Kastensmidt,et al.  Heavy Ions Induced Single Event Upsets Testing of the 28 nm Xilinx Zynq-7000 All Programmable SoC , 2015, 2015 IEEE Radiation Effects Data Workshop (REDW).

[16]  Luca Benini,et al.  Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes , 2017, IEEE Transactions on Parallel and Distributed Systems.

[17]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[18]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[19]  Luca Benini,et al.  Ultra-low-latency lightweight DMA for tightly coupled multi-core clusters , 2014, Conf. Computing Frontiers.

[20]  Philip Heng Wai Leong,et al.  FINN: A Framework for Fast, Scalable Binarized Neural Network Inference , 2016, FPGA.

[21]  Shengen Yan,et al.  Deep Image: Scaling up Image Recognition , 2015, ArXiv.

[22]  Luca Benini,et al.  Origami: A 803-GOp/s/W Convolutional Network Accelerator , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[23]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[24]  Luca Benini,et al.  Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[25]  Jason Cong,et al.  Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[26]  Jason Weston,et al.  Dialog-based Language Learning , 2016, NIPS.

[27]  S. M. García,et al.  2014: , 2020, A Party for Lazarus.

[28]  Eugenio Culurciello,et al.  Snowflake: An efficient hardware accelerator for convolutional neural networks , 2017, 2017 IEEE International Symposium on Circuits and Systems (ISCAS).

[29]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[30]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[31]  Yu Cao,et al.  An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[32]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[33]  Luca Benini,et al.  An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics , 2016, IEEE Transactions on Circuits and Systems I: Regular Papers.

[34]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[35]  Christos-Savvas Bouganis,et al.  Latency-driven design for FPGA-based convolutional neural networks , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[36]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[37]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[38]  Florence March,et al.  2016 , 2016, Affair of the Heart.

[39]  Loukas P. Petrou,et al.  SqueezeJet: High-Level Synthesis Accelerator Design for Deep Convolutional Neural Networks , 2018, ARC.