Rethinking NoCs for spatial neural network accelerators

Applications across image processing, speech recognition, and classification heavily rely on neural network-based algorithms that have demonstrated highly promising results in accuracy. However, such algorithms involve massive computations that are not manageable in general purpose processors. To cope with this challenge, spatial architecture-based accelerators, which consist of an array of hundreds of processing elements (PEs), have emerged. These accelerators achieve high throughput exploiting massive parallel computations over the PEs; however, most of them do not focus on on-chip data movement overhead, which increases with the degree of computational parallelism, and employ primitive networks-on-chip (NoC) such as buses, crossbars, and meshes. Such NoCs work for general purpose multicores, but lack scalability in area, power, latency, and throughput to use inside accelerators, as this work demonstrates. To this end, we propose a novel NoC generator that generates a network tailored for the traffic flows within a neural network, namely scatters, gathers and local communication, facilitating accelerator design. We build our NoC using an array of extremely lightweight microswitches that are energy- and area-efficient compared to traditional on-chip routers. We demonstrate the performance, area, and energy of our micro-switch based networks for convolutional neural network accelerators.

[1]  Xuehai Zhou,et al.  PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[2]  Ran Ginosar,et al.  Network-on-Chip Architectures for Neural Networks , 2010, 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip.

[3]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[4]  Babak Falsafi,et al.  NOC-Out: Microarchitecting a Scale-Out Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[5]  Luis A. Plana,et al.  SpiNNaker: Mapping neural networks onto a massively-parallel chip multiprocessor , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[6]  Youchang Kim,et al.  10.4 A 1.22TOPS and 1.52mW/MHz augmented reality multi-core processor with neural network NoC for HMD applications , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[7]  Liam McDaid,et al.  Scalable Hierarchical Network-on-Chip Architecture for Spiking Neural Network Hardware Implementations , 2013, IEEE Transactions on Parallel and Distributed Systems.

[8]  Mayler G. A. Martins,et al.  Open Cell Library in 15nm FreePDK Technology , 2015, ISPD.

[9]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[10]  Joel Emer,et al.  Eyeriss: an Energy-efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks Accessed Terms of Use , 2022 .

[11]  Radu Marculescu,et al.  Application-specific network-on-chip architecture customization via long-range link insertion , 2005, ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005..

[12]  Narayanan Vijaykrishnan,et al.  A generic reconfigurable neural network architecture as a network on chip , 2004, IEEE International SOC Conference, 2004. Proceedings..

[13]  Srinivasan Murali,et al.  SUNMAP: a tool for automatic topology selection and generation for NoCs , 2004, Proceedings. 41st Design Automation Conference, 2004..

[14]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[15]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[16]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[17]  Bernard Brezzo,et al.  TrueNorth: Design and Tool Flow of a 65 mW 1 Million Neuron Programmable Neurosynaptic Chip , 2015, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[18]  Alexandre Yakovlev,et al.  Connection-centric network for spiking neural networks , 2009, 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip.

[19]  Li-Shiuan Peh,et al.  Breaking the on-chip latency barrier using SMART , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[20]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21]  Liam McDaid,et al.  Hierarchical Network-on-Chip and Traffic Compression for Spiking Neural Network Implementations , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[22]  Asit K. Mishra,et al.  From high-level deep neural models to FPGAs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[24]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[25]  Rishiyur S. Nikhil,et al.  Bluespec System Verilog: efficient, correct RTL from high level specifications , 2004, Proceedings. Second ACM and IEEE International Conference on Formal Methods and Models for Co-Design, 2004. MEMOCODE '04..

[26]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Youchang Kim,et al.  A 1.22 TOPS and 1.52 mW/MHz Augmented Reality Multicore Processor With Neural Network NoC for HMD Applications , 2015, IEEE Journal of Solid-State Circuits.