论文信息 - CaffePresso: Accelerating Convolutional Networks on Embedded SoCs

CaffePresso: Accelerating Convolutional Networks on Embedded SoCs

Auto-tuning and parametric implementation of deep learning kernels allow off-the-shelf accelerator-based embedded platforms to deliver high-performance and energy-efficient mappings of the inference phase of lightweight neural networks. Low-complexity classifiers are characterized by operations on small image maps with two to three deep layers and few class labels. For these use cases, we consider a range of embedded systems with 20W power budgets such as the Xilinx ZC706 (FPGA), NVIDIA Jetson TX1 (GPU), TI Keystone II (DSP), and Adapteva Parallella (RISC+NoC). In CaffePresso, we combine auto-tuning of the implementation parameters, and platform-specific constraints deliver optimized solutions for each input ConvNet specification.

Nachiket Kapre | Gopalakrishna Hegde | Siddhartha | N. Kapre | Gopalakrishna Hegde

[1] Joel Emer,et al. Eyeriss: an Energy-efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks Accessed Terms of Use , 2022 .

[2] Andrew Lavin,et al. Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Jeff Johnson,et al. Fast Convolutional Nets With fbfft: A GPU Performance Evaluation , 2014, ICLR.

[4] Luca Benini,et al. Origami: A Convolutional Network Accelerator , 2015, ACM Great Lakes Symposium on VLSI.

[5] Soheil Ghiasi,et al. Hardware-oriented Approximation of Convolutional Neural Networks , 2016, ArXiv.

[6] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[7] Guy Lemieux,et al. Embedded supercomputing in FPGAs with the VectorBlox MXP Matrix Processor , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[8] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[9] Patrice Y. Simard,et al. High Performance Convolutional Neural Networks for Document Processing , 2006 .

[10] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[11] Song Han,et al. Deep compression and EIE: Efficient inference engine on compressed deep neural network , 2016, 2016 IEEE Hot Chips 28 Symposium (HCS).

[12] Berin Martini,et al. A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[13] Nuno Vasconcelos,et al. Learning Complexity-Aware Cascades for Deep Pedestrian Detection , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14] Vivienne Sze,et al. 14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks , 2016, ISSCC.

[15] Dharmendra S. Modha,et al. Backpropagation for Energy-Efficient Neuromorphic Computing , 2015, NIPS.

[16] Karin Strauss,et al. Accelerating Deep Convolutional Neural Networks Using Specialized Hardware , 2015 .

[17] Ali Farhadi,et al. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[18] Shengen Yan,et al. Deep Image: Scaling up Image Recognition , 2015, ArXiv.

[19] Gopalakrishna Hegde,et al. CaffePresso: An optimized library for Deep Learning on embedded accelerator-based platforms , 2016, 2016 International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES).

[20] Jason Cong,et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[21] John Tran,et al. cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.