FPAP: A Folded Architecture for Energy-Quality Scalable Convolutional Neural Networks

Emerging convolutional neural networks (CNNs) tend to be designed with varied per-layer data widths and sparse representations. However, these two features, which bring many redundant computations, have not been exploited simultaneously in existing hardware architectures for CNNs. This paper proposes an energy-quality scalable architecture, namely folded precision-adjustable processor (FPAP), to eliminate all computational redundancies by using folding techniques. On one hand, FPAP decomposes the dominant multiply-accumulate (MAC) operations into multiple adds and folds them into single arithmetic unit. Only effective adds (or part of them) are then calculated serially. Thus, FPAP can adapt to different per-layer data widths and enable precision-adjustable approximate computing. Particularly, FPAP adaptively selects either activation or weight to be decomposed in every single MAC to minimize the total number of adds and clock cycles. On the other hand, a 1-D convolution is undertaken by a multi-tap transposed finite impulse response (FIR) filter, which is folded into one tap to skip MACs with zero weights or activations. Besides, a judicious delay element remapping scheme and a novel genetic algorithm-based kernel reallocation scheme, are developed to reduce the power consumption in a folded FIR filter and mitigate the load imbalance issue caused by irregular sparsity, respectively. With all these optimizations, FPAP is able to reach comparable or even faster processing speed over the corresponding unfolded design in sparse CNNs while consuming smaller area. Experimental results on real CNN models demonstrate that FPAP can scale its energy efficiency from 4.28 to 23.63 TOP/s/W, and area efficiency from 37.79 to 164.15GOP/s/mm2, respectively, under the TSMC 28-nm HPC CMOS technology.

[1]  Sachin S. Talathi,et al.  Fixed Point Quantization of Deep Convolutional Networks , 2015, ICML.

[2]  Marian Verhelst,et al.  14.5 Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable Convolutional Neural Network processor in 28nm FDSOI , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[3]  Venkatesh Saligrama,et al.  Adaptive Neural Networks for Fast Test-Time Prediction , 2017, ArXiv.

[4]  Jia Deng,et al.  Dynamic Deep Neural Networks: Optimizing Accuracy-Efficiency Trade-offs by Selective Execution , 2017, AAAI.

[5]  Igor Carron,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016 .

[6]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[7]  Bin Liu,et al.  Ternary Weight Networks , 2016, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Darrell Whitley,et al.  A genetic algorithm tutorial , 1994, Statistics and Computing.

[10]  Asit K. Mishra,et al.  Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy , 2017, ICLR.

[11]  Kaushik Roy,et al.  Conditional Deep Learning for energy-efficient and enhanced pattern recognition , 2015, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[12]  Zhongfeng Wang,et al.  Efficient Hardware Architectures for Deep Convolutional Neural Network , 2018, IEEE Transactions on Circuits and Systems I: Regular Papers.

[13]  Patrick Judd,et al.  Stripes: Bit-serial deep neural network computing , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14]  Leibo Liu,et al.  Deep Convolutional Neural Network Architecture With Reconfigurable Computation Patterns , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[15]  Zhongfeng Wang,et al.  Intra-layer nonuniform quantization of convolutional neural network , 2016, 2016 8th International Conference on Wireless Communications & Signal Processing (WCSP).

[16]  Eriko Nurvitadhi,et al.  Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? , 2017, FPGA.

[17]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[18]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[20]  Hoi-Jun Yoo,et al.  UNPU: A 50.6TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[21]  Shaoli Liu,et al.  Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Zhongfeng Wang,et al.  An Energy-Efficient Architecture for Binary Weight Convolutional Neural Networks , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[24]  Tim Dettmers,et al.  8-Bit Approximations for Parallelism in Deep Learning , 2015, ICLR.

[25]  Qiang Xu,et al.  ApproxANN: An approximate computing framework for artificial neural network , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[26]  Dongyoung Kim,et al.  A novel zero weight/activation-aware hardware architecture of convolutional neural network , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[27]  Andreas Moshovos,et al.  Bit-Pragmatic Deep Neural Network Computing , 2016, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28]  Leibo Liu,et al.  A High Energy Efficient Reconfigurable Hybrid Neural Network Processor for Deep Learning Applications , 2018, IEEE Journal of Solid-State Circuits.

[29]  Zhongfeng Wang,et al.  Efficient convolution architectures for convolutional neural network , 2016, 2016 8th International Conference on Wireless Communications & Signal Processing (WCSP).

[30]  Joel Emer,et al.  Eyeriss: an Energy-efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks Accessed Terms of Use , 2022 .

[31]  Dharmendra S. Modha,et al.  Deep neural networks are robust to weight binarization and other non-linear distortions , 2016, ArXiv.

[32]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[33]  Soheil Ghiasi,et al.  Hardware-oriented Approximation of Convolutional Neural Networks , 2016, ArXiv.