3D NoC-enabled heterogeneous manycore architectures for accelerating CNN training: Performance and thermal trade-offs

As deep learning technology is increasingly employed in diverse applications domains, the demand for computational power to enable these algorithms also increases. In this respect, high-performance three-dimensional (3D) heterogeneous manycore systems present a promising direction. However, deep learning on these systems pose several design challenges. First, the network-on-chip (NoC) must handle the traffic requirements of both CPU and GPU communications. Second, 3D system designs must address thermal issues resulting from high-power density. In this work, we propose a design methodology for a heterogeneous 3D NoC architecture that not only satisfies the traffic requirements of both CPUs and GPUs, but also reduces thermal hotspots. To this end, we target the training of two widely employed convolutional neural networks (CNN), namely, LeNet and CIFAR. By using our joint performance-thermal optimization methodology to create a 3D NoC for training CNNs, we reduce the maximum temperature by 22% while incurring only 5% full-system energy-delay-product degradation over a solely performance optimized 3D NoC. This demonstrates that, our design methodology achieves considerable temperature reduction with negligible loss in performance.

[1]  Partha Pratim Pande,et al.  Networks-on-Chip in a Three-Dimensional Environment: A Performance Evaluation , 2009, IEEE Transactions on Computers.

[2]  Jun Yang,et al.  Thermal Management for 3D Processors via Task Scheduling , 2008, 2008 37th International Conference on Parallel Processing.

[3]  Ankur Jain,et al.  Die/wafer stacking with reciprocal design symmetry (RDS) for mask reuse in three-dimensional (3D) integration technology , 2009, 2009 10th International Symposium on Quality Electronic Design.

[4]  Arvind Kumar,et al.  Three-dimensional integrated circuits , 2006, IBM J. Res. Dev..

[5]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[6]  John Kim,et al.  Throughput-Effective On-Chip Networks for Manycore Accelerators , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[7]  Yuan Xie,et al.  3D GPU architecture using cache stacking: Performance, cost, power and thermal analysis , 2009, 2009 IEEE International Conference on Computer Design.

[8]  Mahmut T. Kandemir,et al.  Design and Management of 3D Chip Multiprocessors Using Network-in-Memory , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[9]  Jason Cong,et al.  A thermal-driven floorplanning algorithm for 3D ICs , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..

[10]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[11]  Olav Lysne,et al.  Layered routing in irregular networks , 2006, IEEE Transactions on Parallel and Distributed Systems.

[12]  David A. Wood,et al.  GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors , 2015, 2015 IEEE International Symposium on Workload Characterization.

[13]  Ujjwal Maulik,et al.  A Simulated Annealing-Based Multiobjective Optimization Algorithm: AMOSA , 2008, IEEE Transactions on Evolutionary Computation.

[14]  Martin Burtscher,et al.  Bridging the processor-memory performance gap with 3D IC technology , 2005, IEEE Design & Test of Computers.

[15]  David A. Wood,et al.  gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.

[16]  Jian Xu,et al.  Demystifying 3D ICs: the pros and cons of going vertical , 2005, IEEE Design & Test of Computers.

[17]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[18]  Sudhakar Yalamanchili,et al.  Design space exploration of on-chip ring interconnection for a CPU-GPU heterogeneous architecture , 2013, J. Parallel Distributed Comput..

[19]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[20]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[21]  David Atienza,et al.  3D-ICE: Fast compact transient thermal modeling for 3D ICs with inter-tier liquid cooling , 2010, 2010 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[22]  Radu Marculescu,et al.  Hybrid network-on-chip architectures for accelerating deep learning kernels on heterogeneous manycore platforms , 2016, 2016 International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES).

[23]  Mahmut T. Kandemir,et al.  Managing GPU Concurrency in Heterogeneous Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[24]  Jinchun Kim,et al.  Bandwidth-efficient on-chip interconnect designs for GPGPUs , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[25]  Ran Ginosar,et al.  Network-on-Chip Architectures for Neural Networks , 2010, 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip.

[26]  Partha Pratim Pande,et al.  Design-Space Exploration and Optimization of an Energy-Efficient and Reliable 3-D Small-World Network-on-Chip , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.