论文信息 - A Novel Network Fabric for Efficient Spatio-Temporal Reduction in Flexible DNN Accelerators

A Novel Network Fabric for Efficient Spatio-Temporal Reduction in Flexible DNN Accelerators

Increasing deployment of Deep Neural Networks (DNNs) in a myriad of applications, has recently fueled interest in the development of specific accelerator architectures capable of meeting their stringent performance and energy consumption requirements.DNN accelerators use three separate NoCs within the accelerator, namely distribution, multiplier and reduction networks (or DN, MN and RN, respectively) between the global buffer(s) and compute units (multipliers/adders). These NoCs enable data delivery, and more importantly, on-chip reuse of operands and outputs to minimize the expensive off-chip memory accesses.Among them, the RN, used to generate and reduce the partial sums produced during DNN processing, is what implies the largest fraction of chip area (25% of the total chip area in some cases) and power dissipation (38% of the total chip power budget), thus representing a first-order driver of the energy efficiency of the accelerator.RNs can be orchestrated to exploit a Temporal, Spatial or Spatio-Temporal reduction dataflow. Among these, the latter is the one that has shown superior performance. However, as we demonstrate in this work, a state-of-the-art implementation of the Spatio-Temporal reduction dataflow, based on the addition of Accumulators (Ac) to the RN (i.e. RN+Ac strategy), can result into significant area and energy expenses. To cope with this important issue, we propose STIFT (that stands for Spatio-Temporal Integrated Folding Tree) that implements the Spatio-Temporal reduction dataflow entirely on the RN hardware substrate (i.e. without the need of the extra accumulators). STIFT results into significant area and power savings regarding the more complex RN+Ac strategy, at the same time its performance advantage is preserved.

[1] José L. Abellán,et al. STONNE: Enabling Cycle-Level Microarchitectural Simulation for DNN Inference Accelerators , 2021, IEEE Computer Architecture Letters.

[2] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3] Tianshi Chen,et al. DaDianNao: A Neural Network Supercomputer , 2017, IEEE Transactions on Computers.

[4] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5] Vivienne Sze,et al. Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[6] Hyoukjun Kwon,et al. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.

[7] Tushar Krishna,et al. Data Orchestration in Deep Learning Accelerators , 2020, Synthesis Lectures on Computer Architecture.

[8] Vivienne Sze,et al. Using Dataflow to Optimize Energy Efficiency of Deep Neural Network Accelerators , 2017, IEEE Micro.

[9] Tianshi Chen,et al. ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[10] Vivienne Sze,et al. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks , 2017, IEEE Journal of Solid-State Circuits.

[11] Vivienne Sze,et al. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[12] Hyoukjun Kwon,et al. Rethinking NoCs for spatial neural network accelerators , 2017, 2017 Eleventh IEEE/ACM International Symposium on Networks-on-Chip (NOCS).

[13] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[14] Zhigang Mao,et al. mRNA: Enabling Efficient Mapping Space Exploration for a Reconfiguration Neural Accelerator , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[15] Wei Liu,et al. SSD: Single Shot MultiBox Detector , 2015, ECCV.

[16] Bo Chen,et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[17] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Dipankar Das,et al. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[19] Forrest N. Iandola,et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[20] Luca Benini,et al. Origami: A 803-GOp/s/W Convolutional Network Accelerator , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[21] Kun-Chih Chen,et al. NoC-based DNN accelerator: a future design paradigm , 2019, NOCS.

[22] Vivek Sarkar,et al. Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach , 2018, MICRO.

[23] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.