A highly robust distributed fault-tolerant routing algorithm for NoCs with localized rerouting

Denser transistor integration has enabled the fabrication of multi-tile chips, however, at the expense of higher susceptibility to defects and wear-out. Metal wires comprising the links of Networks-on-Chip (NoCs) are especially vulnerable to such defects, which can render some links disconnected. This paper presents a new fault-tolerant routing scheme to sustain on-chip communication. It uses a localized re-routing approach, whereby de-touring around faulty links -- or complex regions of faults -- is done locally at each node in a purely distributed and dynamic manner, while guaranteeing deadlock- and livelock-freedom. Results using synthetic traffic and real applications with full-system simulations prove its efficacy in addressing a large percentage of NoC links being faulty albeit at a gracefully degraded performance mode.

[1]  Martin Radetzki,et al.  Fault-tolerant architecture and deflection routing for degradable NoC switches , 2009, 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip.

[2]  Lionel M. Ni,et al.  Fault-tolerant wormhole routing in meshes , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[3]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[4]  Jie Wu,et al.  A Fault-Tolerant and Deadlock-Free Routing Protocol in 2D Meshes Based on Odd-Even Turn Model , 2003, IEEE Trans. Computers.

[5]  Doug A. Edwards,et al.  Adaptive stochastic routing in fault-tolerant on-chip networks , 2009, 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip.

[6]  Suresh Chalasani,et al.  Fault-tolerant wormhole routing in tori , 1994, ICS '94.

[7]  Scott A. Mahlke,et al.  BulletProof: a defect-tolerant CMP switch architecture , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[8]  José Duato,et al.  An Efficient Implementation of Distributed Routing Algorithms for NoCs , 2008, Second ACM/IEEE International Symposium on Networks-on-Chip (nocs 2008).

[9]  S. Borkar,et al.  An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS , 2008, IEEE Journal of Solid-State Circuits.

[10]  T. Moon Error Correction Coding: Mathematical Methods and Algorithms , 2005 .

[11]  Sudhakar Yalamanchili,et al.  Interconnection Networks: An Engineering Approach , 2002 .

[12]  David Blaauw,et al.  Vicis: A reliable network for unreliable silicon , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[13]  W. Dally,et al.  Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[14]  L. Benini,et al.  Xpipes: a network-on-chip architecture for gigascale systems-on-chip , 2004, IEEE Circuits and Systems Magazine.

[15]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[16]  Pedro López,et al.  Region-Based Routing: An Efficient Routing Mechanism to Tackle Unreliable Hardware in Network on Chips , 2007, First International Symposium on Networks-on-Chip (NOCS'07).

[17]  José Duato,et al.  A theory of fault-tolerant routing in wormhole networks , 1994, Proceedings of 1994 International Conference on Parallel and Distributed Systems.

[18]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[19]  William J. Dally,et al.  A delay model and speculative architecture for pipelined routers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.