论文信息 - A heterogeneous low-cost and low-latency Ring-Chain network for GPGPUs

A heterogeneous low-cost and low-latency Ring-Chain network for GPGPUs

To achieve high throughput, core count in compute accelerators such as General-Purpose Graphics Processing Units (GPGPUs) increases continuously. The communication demand of these cores boosts the demand for a low-latency packet switched network. As packet latency is mainly composed of per-hop latency, contention latency and serialization latency, a favorable Network-on-Chip (NoC) design should efficiently decrease these three latency contributors to meet the communication demand while keeping hardware cost low. In this paper, we first make two observations about the NoC differences between CMPs and GPGPUs, and then design a Heterogeneous Ring-Chain network (HRCnet) for the GPGPU reply network. HRCnet eliminates conflicts in the network by proposing a ring-similar topology, using a novel node placement and introducing unidirectional channels. Eliminating conflicts reduces the per-hop latency and removes the contention latency, and exploiting the ring-similar topology reduces the serialization latency. Experimental results show the benefits of the low-cost low-latency design. With the same bisection bandwidth compared to the baseline mesh, our work yields a 45% performance improvement while reducing the area by 42% and reducing energy consumption by 60%. Compared to two state-of-the-art GPGPU NoCs, BENoC and DA2mesh, HRCnet achieves more than 42% performance gain at reduced hardware cost. Our work also achieves the highest power and area efficiency among the designs.

Lieven Eeckhout | Xia Zhao | Sheng Ma | Zhiying Wang | Chen Li

[1] Onur Mutlu,et al. A case for bufferless routing in on-chip networks , 2009, ISCA '09.

[2] Jinchun Kim,et al. Bandwidth-efficient on-chip interconnect designs for GPGPUs , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[3] Edward T. Grochowski,et al. Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[4] Chris Fallin,et al. CHIPPER: A low-complexity bufferless deflection router , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[5] Luiz André Barroso,et al. The performance of cache-coherent ring-based multiprocessors , 1993, ISCA '93.

[6] Hwasoo Yeo,et al. Transportation-network-inspired network-on-chip , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[7] Chris Fallin,et al. A High-Performance Hierarchical Ring On-Chip Interconnect with Low-Cost Routers , 2011 .

[8] Chita R. Das,et al. Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[9] John Kim,et al. Providing cost-effective on-chip network bandwidth in GPGPUs , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[10] Naga K. Govindaraju,et al. Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11] Coniferous softwood. GENERAL TERMS , 2003 .

[12] John Kim,et al. Throughput-Effective On-Chip Networks for Manycore Accelerators , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[13] John Kim,et al. Low-cost router microarchitecture for on-chip networks , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14] Michael Stumm,et al. On topology and bisection bandwidth of hierarchical-ring networks for shared-memory multiprocessors , 1998, Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238).

[15] M. Suzuoki,et al. Overview of the architecture, circuit design, and physical implementation of a first-generation cell processor , 2006, IEEE Journal of Solid-State Circuits.

[16] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[17] Yuan Xie,et al. DimNoC: A dim silicon approach towards power-efficient on-chip network , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[18] Efraim Rotem,et al. Power-Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge , 2012, IEEE Micro.

[19] Shekhar Borkar. Thousand Core ChipsA Technology Perspective , 2007, DAC 2007.

[20] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[21] Nan Jiang,et al. A detailed and flexible cycle-accurate Network-on-Chip simulator , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[22] Sriram R. Vangal,et al. A 5-GHz Mesh Interconnect for a Teraflops Processor , 2007, IEEE Micro.

[23] Chen Sun,et al. DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.