A Gradient-Interleaved Scheduler for Energy-Efficient Backpropagation for Training Neural Networks

This paper addresses design of accelerators using systolic architectures for training of neural networks using a novel gradient interleaving approach. Training the neural network involves backpropagation of error and computation of gradients with respect to the activation functions and weights. It is shown that the gradient with respect to the activation function can be computed using a weight-stationary systolic array while the gradient with respect to the weights can be computed using an output-stationary systolic array. The novelty of the proposed approach lies in interleaving the computations of these two gradients to the same configurable systolic array. This results in reuse of the variables from one computation to the other and eliminates unnecessary memory accesses. The proposed approach leads to 1.4 - 2.2 times savings in terms of number of cycles and $1.9 \times$ savings in terms of memory accesses. Thus, the proposed accelerator reduces latency and energy consumption.

[1]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Jiajun Li,et al.  TNPU: an efficient accelerator architecture for training convolutional neural networks , 2019, ASP-DAC.

[4]  Chi-Ying Tsui,et al.  SparseNN: An energy-efficient neural network accelerator exploiting input and output sparsity , 2017, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[5]  Jason Cong,et al.  HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing , 2019, FPGA.

[6]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[7]  Jing Li,et al.  Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network , 2017, FPGA.

[8]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[9]  Hyoukjun Kwon,et al.  MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.

[10]  Yimin Zhuang,et al.  Deep Fusion: A Software Scheduling Method for Memory Access Optimization , 2019, NPC.

[11]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[12]  Xiaowei Li,et al.  FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[13]  Keshab K. Parhi,et al.  High-level DSP synthesis using concurrent transformations, scheduling, and allocation , 1995, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[14]  H. T. Kung,et al.  Systolic Arrays for (VLSI). , 1978 .

[15]  Keshab K. Parhi,et al.  VLSI digital signal processing systems , 1999 .

[16]  Yimin Zhuang,et al.  Partition and Scheduling Algorithms for Neural Network Accelerators , 2019, APPT.

[17]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Keshab K. Parhi Hierarchical Folding and Synthesis of Iterative Data Flow Graphs , 2013, IEEE Transactions on Circuits and Systems II: Express Briefs.

[20]  Chunhua Deng,et al.  PermDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Jason Cong,et al.  Latte: Locality Aware Transformation for High-Level Synthesis , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[22]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[23]  Vivienne Sze,et al.  Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[24]  Thierry Moreau,et al.  A Hardware–Software Blueprint for Flexible Deep Learning Specialization , 2018, IEEE Micro.

[25]  Keshab K. Parhi,et al.  Determining the minimum iteration period of an algorithm , 1995, J. VLSI Signal Process..

[26]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[27]  Matthew Mattina,et al.  SCALE-Sim: Systolic CNN Accelerator , 2018, ArXiv.

[28]  Tao Li,et al.  Eager Pruning: Algorithm and Architecture Support for Fast Training of Deep Neural Networks , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[29]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.