Fairness-oriented and location-aware NUCA for many-core SoC

Non-uniform cache architecture (NUCA) is often employed to organize the last level cache (LLC) by Networks-on-Chip (NoC). However, along with the scaling up for network size of Systems-on-Chip (SoC), two trends gradually begin to emerge. First, the network latency is becoming the major source of the cache access latency. Second, the communication distance and latency gap between different cores is increasing. Such gap can seriously cause the network latency imbalance problem, aggravate the degree of non-uniform for cache access latencies, and then worsen the system performance. In this paper, we propose a novel NUCA-based scheme, named fairness-oriented and location-aware NUCA (FL-NUCA), to alleviate the network latency imbalance problem and achieve more uniform cache access. We strive to equalize network latencies which are measured by three metrics: average latency (AL), latency standard deviation (LSD), and maximum latency (ML). In FL-NUCA, the memory-to-LLC mapping and links are both non-uniform distributed to better fit the network topology and traffics, thereby equalizing network latencies from two aspects, i.e., non-contention latencies and contention latencies, respectively. The experimental results show that FL-NUCA can effectively improve the fairness of network latencies. Compared with the traditional static NUCA (S-NUCA), in simulation with synthetic traffics, the average improvements for AL, LSD, and ML are 20.9%, 36.3%, and 35.0%, respectively. In simulation with PARSEC benchmarks, the average improvements for AL, LSD, and ML are 6.3%, 3.6%, and 11.2%, respectively.

[1]  Huawei Li,et al.  Address Remapping for Static NUCA in NoC-Based Degradable Chip-Multiprocessors , 2010, 2010 IEEE 16th Pacific Rim International Symposium on Dependable Computing.

[2]  Smruti R. Sarangi,et al.  FP-NUCA: A Fast NOC Layer for Implementing Large NUCA Caches , 2015, IEEE Transactions on Parallel and Distributed Systems.

[3]  Simon W. Moore,et al.  A communication characterisation of Splash-2 and Parsec , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[4]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  Ji Wu,et al.  CCAS: Contention and congestion aware switch allocation for network-on-chips , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[6]  Jaehyuk Huh,et al.  A NUCA Substrate for Flexible CMP Cache Sharing , 2007, IEEE Transactions on Parallel and Distributed Systems.

[7]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[8]  Emilio Luque,et al.  A new method to make communication latency uniform: distributed routing balancing , 1999, ICS '99.

[9]  Rajeev Balasubramonian,et al.  Interconnect design considerations for large NUCA caches , 2007, ISCA '07.

[10]  Mahmut T. Kandemir,et al.  A novel migration-based NUCA design for Chip Multiprocessors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Per Stenström,et al.  An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[12]  José Duato,et al.  Achieving balanced buffer utilization with a proper co-design of flow control and routing algorithm , 2014, 2014 Eighth IEEE/ACM International Symposium on Networks-on-Chip (NoCS).

[13]  Mor Harchol-Balter,et al.  Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[14]  Yu Zhang,et al.  Non-uniform fat-meshes for chip multiprocessors , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[15]  Chita R. Das,et al.  Aérgia: exploiting packet latency slack in on-chip networks , 2010, ISCA.

[16]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[17]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[18]  Nan Jiang,et al.  A detailed and flexible cycle-accurate Network-on-Chip simulator , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[19]  Mahmut T. Kandemir,et al.  Addressing End-to-End Memory Access Latency in NoC-Based Multicores , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.