Low-Overhead Network-on-Chip Support for Location-Oblivious Task Placement

Many-core processors will have many processing cores with a network-on-chip (NoC) that provides access to shared resources such as main memory and on-chip caches. However, locally-fair arbitration in multi-stage NoC can lead to globally unfair access to shared resources and impact system-level performance depending on where each task is physically placed. In this work, we propose an arbitration to provide equality-of-service (EoS) in the network and provide support for location-oblivious task placement. We propose using probabilistic arbitration combined with distance-based weights to achieve EoS and overcome the limitation of round-robin arbiter. However, the complexity of probabilistic arbitration results in high area and long latency which negatively impacts performance. In order to reduce the hardware complexity, we propose an hybrid arbiter that switches between a simple arbiter at low load and a complex arbiter at high load. The hybrid arbiter is enabled by the observation that arbitration only impacts the overall performance and global fairness at a high load. We evaluate our arbitration scheme with synthetic traffic patterns and GPGPU benchmarks. Our results shows that hybrid arbiter that combines round-robin arbiter with probabilistic distance-based arbitration reduces performance variation as task placement is varied and also improves average IPC.

[1]  Deborah K. Weisser,et al.  Age-based packet arbitration in large-radix k-ary n-cubes , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[2]  Scott Shenker,et al.  Analysis and simulation of a fair queueing algorithm , 1989, SIGCOMM 1989.

[3]  Onur Mutlu,et al.  Preemptive Virtual Clock: A flexible, efficient, and cost-effective QOS scheme for networks-on-chip , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Nam Sung Kim,et al.  The case for GPGPU spatial multitasking , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[5]  Ran Ginosar,et al.  QNoC: QoS architecture and design process for network on chip , 2004, J. Syst. Archit..

[6]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[7]  William J. Dally,et al.  Allocator implementations for network-on-chip routers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[8]  John Kim,et al.  Probabilistic Distance-Based Arbitration: Providing Equality of Service for Many-Core CMPs , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[9]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[10]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[11]  Yuan Xie,et al.  LOFT: A High Performance Network-on-Chip Providing Quality-of-Service Support , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[12]  Natalie D. Enright Jerger,et al.  Achieving predictable performance through better memory controller placement in many-core CMPs , 2009, ISCA '09.

[13]  Krste Asanovic,et al.  Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks , 2008, 2008 International Symposium on Computer Architecture.

[14]  Chita R. Das,et al.  MediaWorm: A QoS Capable Router Architecture for Clusters , 2002, IEEE Trans. Parallel Distributed Syst..

[15]  Reetuparna Das,et al.  Application-to-core mapping policies to reduce memory interference in multi-core systems , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[16]  Lixia Zhang,et al.  Virtual Clock: A New Traffic Control Algorithm for Packet Switching Networks , 1990, SIGCOMM.

[17]  J.H. Kim,et al.  Rotating Combined Queueing (RCQ): Bandwidth and Latency Guarantees in Low-Cost, High-Performance Networks , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[18]  Calvin Lin,et al.  Adaptive History-Based Memory Schedulers , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[19]  Chita R. Das,et al.  Aérgia: exploiting packet latency slack in on-chip networks , 2010, ISCA.

[20]  Anujan Varma,et al.  Design and analysis of frame-based fair queueing: a new traffic scheduling algorithm for packet-switched networks , 1996, SIGMETRICS '96.

[21]  William E. Weihl,et al.  Lottery scheduling: flexible proportional-share resource management , 1994, OSDI '94.

[22]  Niraj K. Jha,et al.  A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007, ICCD.

[23]  Kees Goossens,et al.  AEthereal network on chip: concepts, architectures, and implementations , 2005, IEEE Design & Test of Computers.

[24]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[25]  Gu-Yeon Wei,et al.  Process Variation Tolerant 3T1D-Based Cache Architectures , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[26]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[27]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[28]  Nan Jiang,et al.  A detailed and flexible cycle-accurate Network-on-Chip simulator , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[29]  W. Dally,et al.  Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[30]  Wolf-Dietrich Weber,et al.  A quality-of-service mechanism for interconnection networks in system-on-chips , 2005, Design, Automation and Test in Europe.

[31]  Alexandra Fedorova,et al.  AKULA: A toolset for experimenting and developing thread placement algorithms on multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[32]  Ganesh Lakshminarayana,et al.  LOTTERYBUS: a new high-performance communication architecture for system-on-chip designs , 2001, DAC '01.

[33]  A. Kumary,et al.  A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007 .