Hybrid network-on-chip architectures for accelerating deep learning kernels on heterogeneous manycore platforms

In recent years, designing specialized manycore heterogeneous architectures for deep learning kernels has become an area of great interest. However, the typical on-chip communication infrastructures employed on conventional manycore platforms are unable to handle both CPU and GPU communication requirements efficiently. Hence, in this paper, our aim is to enhance the performance of heterogeneous manycore architectures through the design of a hybrid NoC consisting of both wireline and wireless links. To this end, we specifically target the resource-intensive backpropagation algorithm commonly used as the training method in deep learning. For backpropagation, the proposed hybrid NoC achieves 1.9× reduction in network latency and improves the network throughput by a factor of 2 with respect to a highly optimized mesh NoC. These network level improvements translate into 25% savings in full system energy-delay-product (EDP). This demonstrates the capability of the proposed hybrid and heterogeneous manycore architecture in accelerating deep learning kernels in an energy-efficient manner.

[1]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[2]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[3]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[4]  Radu Marculescu,et al.  "It's a small world after all": NoC performance optimization via long-range link insertion , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[5]  Olav Lysne,et al.  Layered routing in irregular networks , 2006, IEEE Transactions on Parallel and Distributed Systems.

[6]  David A. Wood,et al.  GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors , 2015, 2015 IEEE International Symposium on Workload Characterization.

[7]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[8]  Terrence Mak,et al.  A Survey of Emerging Interconnects for On-Chip Efficient Multicast and Broadcast in Many-Cores , 2016, IEEE Circuits and Systems Magazine.

[9]  Partha Pratim Pande,et al.  Wireless NoC as Interconnection Backbone for Multicore Chips: Promises and Challenges , 2012, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[10]  Ujjwal Maulik,et al.  A Simulated Annealing-Based Multiobjective Optimization Algorithm: AMOSA , 2008, IEEE Transactions on Evolutionary Computation.

[11]  Partha Pratim Pande,et al.  Design of an Energy-Efficient CMOS-Compatible NoC Architecture with Millimeter-Wave Wireless Interconnects , 2013, IEEE Transactions on Computers.

[12]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[13]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[14]  Partha Pratim Pande,et al.  Enhancing performance of wireless NoCs with distributed MAC protocols , 2015, Sixteenth International Symposium on Quality Electronic Design.

[15]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[16]  David R. Kaeli,et al.  Asymmetric NoC Architectures for GPU Systems , 2015, NOCS.

[17]  David A. Wood,et al.  Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Yue Ping Zhang,et al.  Propagation Mechanisms of Radio Waves Over Intra-Chip Channels With Integrated Antennas: Frequency-Domain Measurements and Time-Domain Analysis , 2007, IEEE Transactions on Antennas and Propagation.

[19]  Masoud Daneshtalab,et al.  Reconfigurable communication fabric for efficient implementation of neural networks , 2015, 2015 10th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC).

[20]  Indrani Paul,et al.  Achieving Exascale Capabilities through Heterogeneous Computing , 2015, IEEE Micro.

[21]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[22]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[23]  Klaus Kofler,et al.  Performance and Scalability of GPU-Based Convolutional Neural Networks , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[24]  Partha Pratim Pande,et al.  Design Space Exploration for Wireless NoCs Incorporating Irregular Network Routing , 2014, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[25]  David A. Wood,et al.  gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.

[26]  John Kim,et al.  Throughput-Effective On-Chip Networks for Manycore Accelerators , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[27]  Sudhakar Yalamanchili,et al.  Design space exploration of on-chip ring interconnection for a CPU-GPU heterogeneous architecture , 2013, J. Parallel Distributed Comput..

[28]  Wim Bogaerts,et al.  Design Challenges in Silicon Photonics , 2014, IEEE Journal of Selected Topics in Quantum Electronics.

[29]  A. Sugavanam,et al.  Wireless communication in a flip-chip package using integrated antennas on silicon substrates , 2005, IEEE Electron Device Letters.

[30]  Sudhakar Yalamanchili,et al.  Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures , 2013, ACM Trans. Design Autom. Electr. Syst..

[31]  Mahmut T. Kandemir,et al.  Managing GPU Concurrency in Heterogeneous Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[32]  Jim D. Garside,et al.  SpiNNaker: A 1-W 18-Core System-on-Chip for Massively-Parallel Neural Network Simulation , 2013, IEEE Journal of Solid-State Circuits.

[33]  Jinchun Kim,et al.  Bandwidth-efficient on-chip interconnect designs for GPGPUs , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[34]  Ran Ginosar,et al.  Network-on-Chip Architectures for Neural Networks , 2010, 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip.

[35]  Yu Su,et al.  Communication Using Antennas Fabricated in Silicon Integrated Circuits , 2007, IEEE Journal of Solid-State Circuits.