Cost-Effective Software Based Fault-Tolerant Routing in Pipelined Networks *

This paper presents a software based approach to fault-tolerant routing in networks using wormhole or virtual cut-through switching. When a message encounters a faulty output link, it is removed from the network by the local router and delivered to the messaging layer of the local node’s operating system. The message passing software can re-route this message, possibly along non-minimal paths. Alternatively the message may be addressed to an intermediate node, which will forward the message to the destination. A message may encounter multiple faults and pass through multiple intermediate nodes. The proposed techniques are applicable to both obliviously and adaptively routed networks. The techniques are specifically targeted towards commercial multiprocessors where the mean time to repair (MTTR) is much smaller than the mean time between router failures (MTBF), i.e., it is sufficient to tolerate a maximum of 2-3 failures. This paper presents requirements for buffer management, deadlock freedom and livelock freedom. Simulation results are presented to evaluate the degradation in latency and throughput as a function of the number and distribution of faults. There are several advantages of such an approach. Router designs are minimally impacted, and thus remain compact and fast. Only messages that encounter faulty components are affected, while the machine is ensured of continued operation until the faulty components can be replaced. The technique leverages existing network technology, and is a good candidate for incorporation into the next generation of multiprocessor networks.

[1]  J. Duato,et al.  Configurable flow control mechanisms for fault-tolerant routing , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[2]  Sudhakar Yalamanchili,et al.  A Family of Fault-Tolerant Routing Protocols for Direct Multiprocessor Networks , 1995, IEEE Trans. Parallel Distributed Syst..

[3]  José Duato,et al.  A theory of fault-tolerant routing in wormhole networks , 1994, Proceedings of 1994 International Conference on Parallel and Distributed Systems.

[4]  Suresh Chalasani,et al.  Fault-tolerant routing with non-adaptive wormhole algorithms in mesh networks , 1994, Proceedings of Supercomputing '94.

[5]  Cauligi S. Raghavendra,et al.  On multicast wormhole routing in multicomputer networks , 1994, Proceedings of 1994 6th IEEE Symposium on Parallel and Distributed Processing.

[6]  Suresh Chalasani,et al.  Fault-tolerant wormhole routing in tori , 1994, ICS '94.

[7]  William J. Dally,et al.  The Reliable Router: A Reliable and High-Performance Communication Substrate for Parallel Computers , 1994, PCRCW.

[8]  J. Larus,et al.  Tempest and Typhoon: user-level shared memory , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[9]  Jae H. Kim,et al.  Compressionless Routing: a framework for adaptive and fault-tolerant routing , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[10]  José Duato,et al.  A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks , 1993, IEEE Trans. Parallel Distributed Syst..

[11]  Martin Walker,et al.  An overview of Cray research computers including the Y-MP/C90 and the new MPP T3D , 1993, SPAA '93.

[12]  Lionel M. Ni,et al.  Fault-tolerant wormhole routing in meshes , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[13]  Rajendra V. Boppana,et al.  A Comparison Of Adaptive Wormhole Routing Algorithms , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[14]  William J. Dally,et al.  Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels , 1993, IEEE Trans. Parallel Distributed Syst..

[15]  Luis Gravano,et al.  Adaptive deadlock- and livelock-free routing with all minimal paths in Torus networks , 1992, SPAA '92.

[16]  Jae H. Kim,et al.  Planar-adaptive routing: low-cost adaptive networks for multiprocessors , 1992, ISCA '92.

[17]  S. Konstantinidou,et al.  Chaos router: architecture and performance , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[18]  Ming-Syan Chen,et al.  Adaptive Fault-Tolerant Routing in Hypercube Multicomputers , 1990, IEEE Trans. Computers.

[19]  William J. Dally,et al.  Virtual-channel flow control , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[20]  William J. Dally,et al.  Deadlock-Free Message Routing in Multiprocessor Interconnection Networks , 1987, IEEE Transactions on Computers.

[21]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[22]  Mark Crovella,et al.  Computer Systems Performance Evaluation , 2007 .

[23]  Andrew A. Chien,et al.  A Cost and Speed Model for k-ary n-Cube Wormhole Routers , 1998, IEEE Trans. Parallel Distributed Syst..

[24]  Lionel M. Ni,et al.  The turn model for adaptive routing , 1998, ISCA '98.

[25]  D. Culler,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.