Design Space Exploration of On-chip Ring Interconnection for a CPU-GPU Architecture

Future chip multiprocessors (CMP) will only grow in core count and diversity in terms of frequency, power consumption, and resource distribution. Incorporating a GPU architecture into CMP, which is more efficient with certain types of applications, is the next stage in this trend. This heterogeneous mix of architectures will use an on-chip interconnection to access shared resources such as last-level cache tiles and memory controllers. The configuration of this on-chip network will likely have a significant impact on resource distribution, fairness, and overall performance. The heterogeneity of this architecture inevitably exerts different pressures on the interconnection due to the differing characteristics and requirements of applications running on CPU and GPU cores. CPU applications are sensitive to latency, while GPGPU applications require massive bandwidth. This is due to the difference in the thread-level parallelism of the two architectures. GPUs use more threads to hide the effect of memory latency but require massive bandwidth to supply those threads. On the other hand, CPU cores typically running only one or two threads concurrently are very sensitive to latency. This study surveys the impact and behavior of the interconnection network when CPU and GPGPU applications run simultaneously. This will shed light on other architectural interconnection studies on CPU-GPU heterogeneous architectures.

[1]  Stephen W. Keckler,et al.  Regional congestion awareness for load balance in networks-on-chip , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[2]  Chita R. Das,et al.  A case for heterogeneous on-chip interconnects for CMPs , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[3]  Zeljko Zilic,et al.  A Hybrid Ring/Mesh Interconnect for Network-on-Chip Using Hierarchical Rings for Global Routing , 2007, First International Symposium on Networks-on-Chip (NOCS'07).

[4]  Hsien-Hsin S. Lee,et al.  COMPASS: a programmable data prefetcher using idle GPU shaders , 2010, ASPLOS XV.

[5]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[6]  Gul N. Khan,et al.  Throughput-Oriented NoC Topology Generation and Analysis for High Performance SoCs , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[7]  Onur Mutlu,et al.  Kilo-NOC: A heterogeneous network-on-chip architecture for scalability and service guarantees , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[8]  G. Edward Suh,et al.  A new memory monitoring scheme for memory-aware scheduling and partitioning , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[9]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[10]  Radu Marculescu,et al.  Exploiting the Routing Flexibility for Energy/Performance Aware Mapping of Regular NoC Architectures , 2003, DATE.

[11]  Chita R. Das,et al.  Application-aware prioritization mechanisms for on-chip networks , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[13]  Hyesoon Kim,et al.  TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[14]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[15]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[16]  Cruz Izu,et al.  The Adaptive Bubble Router , 2001, J. Parallel Distributed Comput..

[17]  Rajiv Kapoor,et al.  Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[18]  Michael J. Schulte,et al.  ERCBench: An Open-Source Benchmark Suite for Embedded and Reconfigurable Computing , 2010, 2010 International Conference on Field Programmable Logic and Applications.

[19]  John Kim,et al.  Throughput-Effective On-Chip Networks for Manycore Accelerators , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[20]  Onur Mutlu,et al.  A case for bufferless routing in on-chip networks , 2009, ISCA '09.

[21]  George Michelogiannakis,et al.  An analysis of on-chip interconnection networks for large-scale chip multiprocessors , 2010, TACO.

[22]  Gabriel H. Loh,et al.  PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches , 2009, ISCA '09.

[23]  William J. Dally,et al.  Flattened butterfly: a cost-efficient topology for high-radix networks , 2007, ISCA '07.

[24]  Onur Mutlu,et al.  Express Cube Topologies for on-Chip Interconnects , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[25]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[26]  Chita R. Das,et al.  Aérgia: exploiting packet latency slack in on-chip networks , 2010, ISCA.

[27]  Nicola Concer,et al.  aEqualized: A novel routing algorithm for the Spidergon Network On Chip , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[28]  Natalie D. Enright Jerger,et al.  DBAR: An efficient routing algorithm to support multiple concurrent applications in networks-on-chip , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[29]  Miltos D. Grammatikakis,et al.  NoC Topologies Exploration based on Mapping and Simulation Models , 2007, 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007).

[30]  Timothy Mark Pinkston,et al.  On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus , 2007, First International Symposium on Networks-on-Chip (NOCS'07).

[31]  Radu Marculescu,et al.  DyAD - smart routing for networks-on-chip , 2004, Proceedings. 41st Design Automation Conference, 2004..

[32]  Natalie D. Enright Jerger,et al.  Achieving predictable performance through better memory controller placement in many-core CMPs , 2009, ISCA '09.