论文信息 - Learning-Based Application-Agnostic 3D NoC Design for Heterogeneous Manycore Systems

Learning-Based Application-Agnostic 3D NoC Design for Heterogeneous Manycore Systems

The rising use of deep learning and other big-data algorithms has led to an increasing demand for hardware platforms that are computationally powerful, yet energy-efficient. Due to the amount of data parallelism in these algorithms, high-performance three-dimensional (3D) manycore platforms that incorporate both CPUs and GPUs present a promising direction. However, as systems use heterogeneity (e.g., a combination of CPUs, GPUs, and accelerators) to improve performance and efficiency, it becomes more pertinent to address the distinct and likely conflicting communication requirements (e.g., CPU memory access latency or GPU network throughput) that arise from such heterogeneity. Unfortunately, it is difficult to quickly explore the hardware design space and choose appropriate tradeoffs between these heterogeneous requirements. To address these challenges, we propose the design of a 3D Network-on-Chip (NoC) for heterogeneous manycore platforms that considers the appropriate design objectives for a 3D heterogeneous system and explores various tradeoffs using an efficient machine learning (ML)-based multi-objective optimization (MOO) technique. The proposed design space exploration considers the various requirements of its heterogeneous components and generates a set of 3D NoC architectures that efficiently trades off these design objectives. Our findings show that by jointly considering these requirements (latency, throughput, temperature, and energy), we can achieve 9.6 percent better Energy-Delay Product on average at nearly iso-temperature conditions when compared to a thermally-optimized design for 3D heterogeneous NoCs. More importantly, our results suggest that our 3D NoCs optimized for a few applications can be generalized for unknown applications as well. Our results show that these generalized 3D NoCs only incur a 1.8 percent (36-tile system) and 1.1 percent (64-tile system) average performance loss compared to application-specific NoCs.

[1] Anne Auger,et al. Theory of the hypervolume indicator: optimal μ-distributions and the choice of the reference point , 2009, FOGA '09.

[2] Chenchen Deng,et al. A Multi-Objective Model Oriented Mapping Approach for NoC-based Computing Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.

[3] Partha Pratim Pande,et al. Networks-on-Chip in a Three-Dimensional Environment: A Performance Evaluation , 2009, IEEE Transactions on Computers.

[4] David Atienza,et al. 3D-ICE: Fast compact transient thermal modeling for 3D ICs with inter-tier liquid cooling , 2010, 2010 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[5] Mahmut T. Kandemir,et al. Managing GPU Concurrency in Heterogeneous Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[6] John Kim,et al. Throughput-Effective On-Chip Networks for Manycore Accelerators , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[7] Jinchun Kim,et al. Bandwidth-efficient on-chip interconnect designs for GPGPUs , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[8] Ankur Jain,et al. Die/wafer stacking with reciprocal design symmetry (RDS) for mask reuse in three-dimensional (3D) integration technology , 2009, 2009 10th International Symposium on Quality Electronic Design.

[9] Radu Marculescu,et al. 3D NoC-enabled heterogeneous manycore architectures for accelerating CNN training: Performance and thermal trade-offs , 2017, 2017 Eleventh IEEE/ACM International Symposium on Networks-on-Chip (NOCS).

[10] Jian Xu,et al. Demystifying 3D ICs: the pros and cons of going vertical , 2005, IEEE Design & Test of Computers.

[11] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[12] Partha Pratim Pande,et al. Design-Space Exploration and Optimization of an Energy-Efficient and Reliable 3-D Small-World Network-on-Chip , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[13] Berkin Özisikyilmaz,et al. Efficient system design space exploration using machine learning techniques , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[14] Jun Yang,et al. Thermal Management for 3D Processors via Task Scheduling , 2008, 2008 37th International Conference on Parallel Processing.

[15] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[17] Sudhakar Yalamanchili,et al. Design space exploration of on-chip ring interconnection for a CPU-GPU heterogeneous architecture , 2013, J. Parallel Distributed Comput..

[18] Nam Sung Kim,et al. GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[19] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20] Mohsin M. Jamali,et al. Energy analysis and NoC design for heterogeneous MPSoC platform for a video application , 2013, 2013 IEEE 56th International Midwest Symposium on Circuits and Systems (MWSCAS).

[21] Olav Lysne,et al. Layered routing in irregular networks , 2006, IEEE Transactions on Parallel and Distributed Systems.

[22] David A. Wood,et al. GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors , 2015, 2015 IEEE International Symposium on Workload Characterization.

[23] Jason Cong,et al. A thermal-driven floorplanning algorithm for 3D ICs , 2004, ICCAD 2004.

[24] Vincenzo Catania,et al. Efficient design space exploration for application specific systems-on-a-chip , 2007, J. Syst. Archit..

[25] David A. Wood,et al. Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26] Niraj K. Jha,et al. GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[27] Klaus Kofler,et al. Performance and Scalability of GPU-Based Convolutional Neural Networks , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[28] DebK.,et al. A fast and elitist multiobjective genetic algorithm , 2002 .

[29] Vittorio Zaccaria,et al. OSCAR: An Optimization Methodology Exploiting Spatial Correlation in Multicore Design Spaces , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[30] Geoffrey J. Gordon,et al. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[31] Ujjwal Maulik,et al. A Simulated Annealing-Based Multiobjective Optimization Algorithm: AMOSA , 2008, IEEE Transactions on Evolutionary Computation.

[32] Haytham Elmiligi,et al. Multi-objective optimization for Networks-on-Chip architectures using Genetic Algorithms , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[33] Radu Marculescu,et al. On-Chip Communication Network for Efficient Training of Deep Convolutional Networks on Heterogeneous Manycore Systems , 2017, IEEE Transactions on Computers.

[34] Kalyanmoy Deb,et al. A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[35] R. Lyndon While,et al. A faster algorithm for calculating hypervolume , 2006, IEEE Transactions on Evolutionary Computation.

[36] Mohammad Mirza-Aghatabar,et al. High-Level Modeling Approach for Analyzing the Effects of Traffic Models on Power and Throughput in Mesh-Based NoCs , 2008, 21st International Conference on VLSI Design (VLSID 2008).

[37] Lothar Thiele,et al. The Hypervolume Indicator Revisited: On the Design of Pareto-compliant Indicators Via Weighted Integration , 2007, EMO.

[38] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[39] Yuan Xie,et al. 3D GPU architecture using cache stacking: Performance, cost, power and thermal analysis , 2009, 2009 IEEE International Conference on Computer Design.

[40] Mahmut T. Kandemir,et al. Design and Management of 3D Chip Multiprocessors Using Network-in-Memory , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[41] Andrew W. Moore,et al. Learning Evaluation Functions to Improve Optimization by Local Search , 2001, J. Mach. Learn. Res..

[42] Martin Burtscher,et al. Bridging the processor-memory performance gap with 3D IC technology , 2005, IEEE Design & Test of Computers.

[43] David A. Wood,et al. gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.