NoC-based DNN accelerator: a future design paradigm

Deep Neural Networks (DNN) have shown significant advantages in many domains such as pattern recognition, prediction, and control optimization. The edge computing demand in the Internet-of-Things era has motivated many kinds of computing platforms to accelerate the DNN operations. The most common platforms are CPU, GPU, ASIC, and FPGA. However, these platforms suffer from low performance (i.e., CPU and GPU), large power consumption (i.e., CPU, GPU, ASIC, and FPGA), or low computational flexibility at runtime (i.e., FPGA and ASIC). In this paper, we suggest the NoC-based DNN platform as a new accelerator design paradigm. The NoC-based designs can reduce the off-chip memory accesses through a flexible interconnect that facilitates data exchange between processing elements on the chip. We first comprehensively investigate conventional platforms and methodologies used in DNN computing. Then we study and analyze different design parameters to implement the NoC-based DNN accelerator. The presented accelerator is based on mesh topology, neuron clustering, random mapping, and XY-routing. The experimental results on LeNet, MobileNet, and VGG-16 models show the benefits of the NoC-based DNN accelerator in reducing off-chip memory accesses and improving runtime computational flexibility.

[1]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2]  Masoud Daneshtalab,et al.  EbDa: A new theory on design and verification of deadlock-free interconnection networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[3]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[4]  Jim D. Garside,et al.  SpiNNaker: A 1-W 18-Core System-on-Chip for Massively-Parallel Neural Network Simulation , 2013, IEEE Journal of Solid-State Circuits.

[5]  Yann LeCun,et al.  An FPGA-based stream processor for embedded real-time vision with Convolutional Networks , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[6]  Aline. Vieira-de-Mello ATLAS-An Environment for NoC Generation and Evaluation , 2011 .

[7]  Nan Jiang,et al.  A detailed and flexible cycle-accurate Network-on-Chip simulator , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[8]  Sven Behnke,et al.  Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition , 2010, ICANN.

[9]  Liam McDaid,et al.  Scalable Hierarchical Network-on-Chip Architecture for Spiking Neural Network Hardware Implementations , 2013, IEEE Transactions on Parallel and Distributed Systems.

[10]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[11]  Christopher R'e,et al.  Caffe con Troll: Shallow Ideas to Speed Up Deep Learning , 2015, DanaC@SIGMOD.

[12]  Toru Baji,et al.  Evolution of the GPU Device widely used in AI and Massive Parallel Processing , 2018, 2018 IEEE 2nd Electron Devices Technology and Manufacturing Conference (EDTM).

[13]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[14]  Marian Verhelst,et al.  A 0.3–2.6 TOPS/W precision-scalable processor for real-time large-scale ConvNets , 2016, 2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits).

[15]  Hyoukjun Kwon,et al.  Rethinking NoCs for spatial neural network accelerators , 2017, 2017 Eleventh IEEE/ACM International Symposium on Networks-on-Chip (NOCS).

[16]  Masoud Daneshtalab,et al.  CuPAN - High Throughput On-chip Interconnection for Neural Networks , 2015, ICONIP.

[17]  Ning Li,et al.  A statistic approach for power analysis of integrated GPU , 2019, Soft Comput..

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Masoud Daneshtalab,et al.  Reconfigurable communication fabric for efficient implementation of neural networks , 2015, 2015 10th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC).

[20]  Hadi Esmaeilzadeh,et al.  Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network , 2017, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[21]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[22]  Qiong Wu,et al.  Testing aware dynamic mapping for path-centric network-on-chip test , 2019, Integr..

[23]  Leibo Liu,et al.  A 1.06-to-5.09 TOPS/W reconfigurable hybrid-neural-network processor for deep learning applications , 2017, 2017 Symposium on VLSI Circuits.

[24]  Tianshi Chen,et al.  DaDianNao: A Neural Network Supercomputer , 2017, IEEE Transactions on Computers.

[25]  Efraim Rotem,et al.  Inside 6th-Generation Intel Core: New Microarchitecture Code-Named Skylake , 2017, IEEE Micro.

[26]  Hoi-Jun Yoo,et al.  UNPU: An Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit Precision , 2019, IEEE Journal of Solid-State Circuits.

[27]  Vivienne Sze,et al.  Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[28]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[29]  Yiran Chen,et al.  Neu-NoC: A high-efficient interconnection network for accelerated neuromorphic systems , 2018, 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC).

[30]  Vivienne Sze,et al.  Eyeriss v2: A Flexible and High-Performance Accelerator for Emerging Deep Neural Networks , 2018, ArXiv.

[31]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[32]  Masoud Daneshtalab,et al.  Routing Algorithms in Networks-on-Chip , 2013 .

[33]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[34]  Ce Liu,et al.  Deep Convolutional Neural Network for Image Deconvolution , 2014, NIPS.

[35]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[36]  Salvatore Monteleone,et al.  Cycle-Accurate Network on Chip Simulation with Noxim , 2016, ACM Trans. Model. Comput. Simul..

[37]  Kun-Chih Jimmy Chen,et al.  NN-Noxim: High-Level Cycle-Accurate NoC-based Neural Networks Simulator , 2018, 2018 11th International Workshop on Network on Chip Architectures (NoCArc).

[38]  Kun-Chih Jimmy Chen,et al.  Cycle-Accurate NoC-based Convolutional Neural Network Simulator , 2019, COINS.

[39]  Leibo Liu,et al.  A High Energy Efficient Reconfigurable Hybrid Neural Network Processor for Deep Learning Applications , 2018, IEEE Journal of Solid-State Circuits.

[40]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[41]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[42]  Luca P. Carloni,et al.  NoC-Based Support of Heterogeneous Cache-Coherence Models for Accelerators , 2018, 2018 Twelfth IEEE/ACM International Symposium on Networks-on-Chip (NOCS).

[43]  Zhenyu Liu,et al.  High-Performance FPGA-Based CNN Accelerator With Block-Floating-Point Arithmetic , 2019, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[44]  Yang Wang,et al.  BigDL: A Distributed Deep Learning Framework for Big Data , 2018, SoCC.

[45]  Liam McDaid,et al.  Scalable Networks-on-Chip Interconnected Architecture for Astrocyte-Neuron Networks , 2016, IEEE Transactions on Circuits and Systems I: Regular Papers.

[46]  Qiang Li,et al.  A lifetime-aware mapping algorithm to extend MTTF of Networks-on-Chip , 2018, 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC).

[47]  Hannu Tenhunen,et al.  HARAQ: Congestion-Aware Learning Model for Highly Adaptive Routing Algorithm in On-Chip Networks , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[48]  Xuehai Zhou,et al.  PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.