FPAP: A Folded Architecture for Efficient Computing of Convolutional Neural Networks

Convolutional neural networks (CNNs) have found extensive applications in practice. However, weight/activation's sparsity and different data precision requirements across layers lead to a large amount of redundant computations. In this paper, we propose an efficient architecture for CNNs, named Folded Precision-Adjustable Processor (FPAP), to skip those unnecessary computations with ease. Computations are folded in the following two aspects to achieve efficient computing. On one hand, the dominant multiply-and-add (MAC) operations are performed bit-serially based on a bit-pair encoding algorithm so that the FPAP can adapt to different numerical precisions without using multipliers with long data width. On the other hand, a 1-D convolution is undertaken by a multi-tap transposed finite impulse response (FIR) filter, which is folded into one tap so that computations involving zero activations and weights can be easily skipped. Equipped with the precision-adjustable MAC unit and the folded FIR filter structure, a well-designed array architecture, consisting of many identical processing elements is developed, which is scalable for different throughput requirements and highly flexible for different numerical precisions. Besides, a novel genetic algorithm based kernel reallocation scheme is introduced to mitigate the load imbalance issue. Our synthesis results demonstrate that the proposed FPAP can significantly reduce the logic complexity and the critical path over the corresponding unfolded design, which only delivers slightly higher throughput when processing sparse and compact models. Our experiments also show that FPAP can scale its energy efficiency from 1.01TOP/s/W to 6.26TOP/s/W under 90nm CMOS technology when different data precisions are used.

[1]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[2]  Sachin S. Talathi,et al.  Fixed Point Quantization of Deep Convolutional Networks , 2015, ICML.

[3]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  Bin Liu,et al.  Ternary Weight Networks , 2016, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[6]  Darrell Whitley,et al.  A genetic algorithm tutorial , 1994, Statistics and Computing.

[7]  Patrick Judd,et al.  Stripes: Bit-serial deep neural network computing , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Igor Carron,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016 .

[9]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[10]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[11]  Jun Lin,et al.  Intra-layer Nonuniform Quantization for Deep Convolutional Neural Network , 2016 .

[12]  Dongyoung Kim,et al.  A novel zero weight/activation-aware hardware architecture of convolutional neural network , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[13]  Song Han,et al.  Trained Ternary Quantization , 2016, ICLR.

[14]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[15]  Joel Emer,et al.  Eyeriss: an Energy-efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks Accessed Terms of Use , 2022 .

[16]  Zhongfeng Wang,et al.  Intra-layer nonuniform quantization of convolutional neural network , 2016, 2016 8th International Conference on Wireless Communications & Signal Processing (WCSP).

[17]  Shaoli Liu,et al.  Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).