USCA: A Unified Systolic Convolution Array Architecture for Accelerating Sparse Neural Network

Due to the intensive computational complexity and various types of convolution, it is a challange to implement different CNN models on a specific hardware. Many previous works focus on data reuse and sparsity exploration to accelerate computation but fail to support various types of convolution efficiently. When dealing with variants of conventional convolution, such as deconvolution or dilated convolution, previous accelerators waste time on padding zeroes and convolving with padded feature maps. In this paper, we propose a unified convolution algorithm to intelligently combine several convolution types together and exploit the sparsity in activations. The padding process can be skipped by the proposed algorithm. Moreover, a unified systolic convolution array (USCA) architecture is developed based on the algorithm. The USCA architecture is implemented with a TSMC 28nm CMOS technology. The implementation results demonstrate that the architecture costs 206k logic gates and 114.7kB on-chip memory. It can reach a peak performance of 374.7GOPs and comsumes 201.1mW at a frequency of 1449MHz. Compared to similar works, USCA architecture achieves 3 × energy efficiency, which is measured by the number of GOPS per watt. Besides, to the best of our knowledge, USCA is the first architecture that can simultaneously support conventional convolution, deconvolution, and dilated convolution in an efficient way.

[1]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Daniel Rueckert,et al.  Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[5]  Hannu Tenhunen,et al.  A 3D Tiled Low Power Accelerator for Convolutional Neural Network , 2018, 2018 IEEE International Symposium on Circuits and Systems (ISCAS).

[6]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[7]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Xinyu Zhang A Design Methodology for Efficient Implementation of Deconvolutional Neural Networks on an FPGA , 2017, ArXiv.

[9]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[10]  Shaoli Liu,et al.  Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).