A Novel 1D-Convolution Accelerator for Low-Power Real-time CNN processing on the Edge

With the rise of deep learning, the demand for real-time edge intelligence is greater than ever. Current algorithm and hardware realizations often focus on the cloud paradigm and maintain the assumption that the entire frames data is available in large batches. As a result, obtaining real-time AI inference at the edge has been a tough goal due to tight-latency awareness as well as streaming nature of the data. There is an inherent need for novel architectures that can realize latency-aware agile deep learning algorithms at the edge. This paper introduces a novel joint algorithm architecture approach to enable real-time low-power Convolutional Neural Network (CNN) processing on edge devices. The core of the proposed approach is utilizing 1D dimensional convolution with an architecture that can truly benefit from the algorithm optimization. On the algorithm side, we present a novel training and inference based on 1D convolution. On the architecture side, we present a novel data flow architecture with the capability of performing on-the-fly 1D convolution over the pixel stream. Our results on Xilinx Zynq-7000 FPGA for SqueezeNet demonstrates only 2% lost in accuracy while maintaining real-time processing of 60 frames per second with only 1.73W power consumption. The Dynamic power consumption is 7.3X lower than regular 2D convolution CNN for performing the same frame rate, and 4.3X less than Nvidia Jetson TX2 total power, performing only 30 frame per second.

[1]  Eriko Nurvitadhi,et al.  Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? , 2017, FPGA.

[2]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Hamed Tabkhi,et al.  A Reconfigurable Streaming Processor for Real-Time Low-Power Execution of Convolutional Neural Networks at the Edge , 2018, EDGE.

[4]  David C. Anastasiu,et al.  The NVIDIA AI City Challenge , 2017, 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI).

[5]  Xiaogang Wang,et al.  DeepID3: Face Recognition with Very Deep Neural Networks , 2015, ArXiv.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  A. Strollo,et al.  Low-power approximate MAC unit , 2017, 2017 13th Conference on Ph.D. Research in Microelectronics and Electronics (PRIME).

[8]  Asit K. Mishra,et al.  From high-level deep neural models to FPGAs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  David Gregg,et al.  Low Complexity Multiply-Accumulate Units for Convolutional Neural Networks with Weight-Sharing , 2018, ACM Trans. Archit. Code Optim..

[10]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[11]  Klaus Kofler,et al.  Performance and Scalability of GPU-Based Convolutional Neural Networks , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[12]  Christoforos E. Kozyrakis,et al.  Convolution engine , 2015, Commun. ACM.

[13]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[14]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[15]  Susanta Mukhopadhyay,et al.  Fast Hardware Architecture for 2-D Separable Convolution Operations , 2018, IEEE Transactions on Circuits and Systems II: Express Briefs.

[16]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Soheil Ghiasi,et al.  CNNdroid: GPU-Accelerated Execution of Trained Deep Convolutional Neural Networks on Android , 2015, ACM Multimedia.

[18]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[19]  Hiroki Nakahara,et al.  A fully connected layer elimination for a binarizec convolutional neural network on an FPGA , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[20]  Lars Petersson,et al.  DecomposeMe: Simplifying ConvNets for End-to-End Learning , 2016, ArXiv.

[21]  David Gregg,et al.  Parallel Multi Channel convolution using General Matrix Multiplication , 2017, 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[22]  Igor Carron,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016 .

[23]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[24]  Jun-Seok Park,et al.  14.6 A 1.42TOPS/W deep convolutional neural network recognition processor for intelligent IoE systems , 2016, 2016 IEEE International Solid-State Circuits Conference (ISSCC).

[25]  Michael Ferdman,et al.  Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[26]  Gunar Schirner,et al.  Function-Level Processor (FLP): A Novel Processor Class for Efficient Processing of Streaming Applications , 2015, Journal of Signal Processing Systems.

[27]  Sherief Reda,et al.  Hardware-software codesign of accurate, multiplier-free Deep Neural Networks , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[28]  Vincent Lepetit,et al.  Learning Separable Filters , 2013, CVPR.

[29]  Michael Ferdman,et al.  Maximizing CNN accelerator efficiency through resource partitioning , 2016, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[30]  Alessandro Aimar,et al.  NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[31]  Zelong Wang,et al.  Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA , 2018, FPGA.

[32]  Dawei Li,et al.  DeepCham: Collaborative Edge-Mediated Adaptive Deep Learning for Mobile Object Recognition , 2016, 2016 IEEE/ACM Symposium on Edge Computing (SEC).

[33]  Gernot A. Fink,et al.  Face Detection Using GPU-Based Convolutional Neural Networks , 2009, CAIP.

[34]  Gregory Shakhnarovich,et al.  FractalNet: Ultra-Deep Neural Networks without Residuals , 2016, ICLR.

[35]  Zhen Li,et al.  A survey of neural network accelerators , 2016, Frontiers of Computer Science.

[36]  Kyandoghere Kyamakya,et al.  CNN based high performance computing for real time image processing on GPU , 2011 .

[37]  Ahmed Hemani,et al.  MOCHA: Morphable Locality and Compression Aware Architecture for Convolutional Neural Networks , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[38]  Manoj Alwani,et al.  Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[39]  Lee-Sup Kim,et al.  A kernel decomposition architecture for binary-weight Convolutional Neural Networks , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).