Adaptive Cluster Throttling: Improving High-Load Performance in Bufferless On-Chip Networks

Higher core counts and increasing focus on energy efficiency in modern Chip Multiprocessors (CMP) have led to renewed interest in simple and energy-efficient Network-on-Chip (NoC) designs. Several recent proposed designs trade off network capacity for efficiency, based on the observation that traditional networks are overprovisioned for many workloads. Bufferless routing is one such example. However, when the application workload requires high interconnect performance, the inefficiencies of bufferless interconnects can cause significant performance degradations. Previous work has tackled various issues with bufferless routing, but little work has been done to improve performance at high network load. Fundamental improvements in bufferless network performance at high load could extend the benefits of lower energy and smaller die area to a wider range of potential applications. In this work, we propose ACT (Adaptive Cluster Throttling), a source-throttling mechanism that provides better system performance and fairness than the best current mechanisms on bufferless networks. By batching applications into clusters, and alternately throttling different clusters, ACT provides a chance for all applications to inject traffic into the network while maintaining control over total network load. We show 11.9% (10.2%) system performance gain on average with 14.5% (15.1%) improvement in fairness over 60 network-intensive workloads on a 4x4 (8x8) bufferless NoC. At high network load, ACT achieves nearly half the performance gain over a bufferless baseline that a conventional buffered network achieves, while reducing network power by 15.4% (5.4%).

[1]  John Kim,et al.  Low-cost router microarchitecture for on-chip networks , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Onur Mutlu,et al.  A case for bufferless routing in on-chip networks , 2009, ISCA '09.

[3]  Mor Harchol-Balter,et al.  Thread Cluster Memory Scheduling , 2011, IEEE Micro.

[4]  Chita R. Das,et al.  A low latency router supporting adaptivity for on-chip interconnects , 2005, Proceedings. 42nd Design Automation Conference, 2005..

[5]  William J. Dally Virtual-channel flow control , 1990, ISCA '90.

[6]  Mor Harchol-Balter,et al.  ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[7]  Mithuna Thottethodi,et al.  Self-tuned congestion control for multiprocessor networks , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[8]  William J. Dally,et al.  Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels , 1993, IEEE Trans. Parallel Distributed Syst..

[9]  William J. Dally,et al.  GOAL: a load-balanced adaptive routing algorithm for torus networks , 2003, ISCA '03.

[10]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[11]  Jian Li,et al.  Memory Latency Reduction via Thread Throttling , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[12]  Coniferous softwood GENERAL TERMS , 2003 .

[13]  George Michelogiannakis,et al.  Evaluating Bufferless Flow Control for On-chip Networks , 2010, 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip.

[14]  Avi Mendelson,et al.  Fairness and Throughput in Switch on Event Multithreading , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[15]  Chita R. Das,et al.  Application-aware prioritization mechanisms for on-chip networks , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Natalie D. Enright Jerger,et al.  SCARAB: A single cycle adaptive routing and bufferless network , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Stijn Eyerman,et al.  System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[18]  Stephen W. Keckler,et al.  Regional congestion awareness for load balance in networks-on-chip , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[19]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[20]  MutluOnur,et al.  A case for bufferless routing in on-chip networks , 2009 .

[21]  O. Mutlu,et al.  Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems , 2010, ASPLOS XV.

[22]  Manoj Franklin,et al.  Balancing thoughput and fairness in SMT processors , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[23]  Onur Mutlu,et al.  Express Cube Topologies for on-Chip Interconnects , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[24]  Francisco J. Cazorla,et al.  QoS for high-performance SMT processors in embedded systems , 2004, IEEE Micro.

[25]  José Duato,et al.  A new scalable and cost-effective congestion management strategy for lossless multistage interconnection networks , 2005, 11th International Symposium on High-Performance Computer Architecture.

[26]  A. Snavely,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[27]  T. N. Vijaykumar,et al.  Adaptive Flow Control for Robust Performance and Energy , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[28]  Mor Harchol-Balter,et al.  Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[29]  Chris Fallin,et al.  CHIPPER: A low-complexity bufferless deflection router , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[30]  Pedro López,et al.  Reducing Packet Dropping in a Bufferless NoC , 2008, Euro-Par.

[31]  W. Dally,et al.  Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[32]  Axel Jantsch,et al.  Evaluation of on-chip networks using deflection routing , 2006, GLSVLSI '06.

[33]  Chris Fallin,et al.  Next generation on-chip networks: what kind of congestion control do we need? , 2010, Hotnets-IX.

[34]  William J. Dally,et al.  Flattened butterfly: a cost-efficient topology for high-radix networks , 2007, ISCA '07.

[35]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[36]  Hans Vandierendonck,et al.  Fairness Metrics for Multi-Threaded Processors , 2011, IEEE Computer Architecture Letters.