A low cost fault tolerant packet routing for parallel computers

This paper presents a new switching mechanism to tolerate arbitrary faults in interconnection networks with a negligible implementation cost. Although our routing technique can be applied to any regular or irregular topology, in this paper we focus on its application to k-ary n-cube networks when managing both synthetic and real traffic workloads. Our mechanism is effective regardless the number of faults and their configuration. When the network is working without any fault, no overhead is added to the original routing scheme. In the presence of a low number of faults, the network sustains a performance close to that observed under fault-free conditions. Finally, when the number of faults increases, the system exhibits a graceful performance degradation.

[1]  Leonard Kleinrock,et al.  Virtual Cut-Through: A New Computer Communication Switching Technique , 1979, Comput. Networks.

[2]  B. Bollobás,et al.  An algorithm for finding hamilton paths and cycles in random graphs , 1987 .

[3]  Andrew A. Chien,et al.  Compressionless routing: a framework for adaptive and fault-tolerant routing , 1994, ISCA '94.

[4]  Xiaola Lin,et al.  Deadlock-Free Multicast Wormhole Routing in 2-D Mesh Multicomputers , 1994, IEEE Trans. Parallel Distributed Syst..

[5]  Suresh Chalasani,et al.  Fault-tolerant wormhole routing in tori , 1994, ICS '94.

[6]  Suresh Chalasani,et al.  Fault-Tolerant Wormhole Routing Algorithms for Mesh Networks , 1995, IEEE Trans. Computers.

[7]  Suresh Chalasani,et al.  Communication in Multicomputers with Nonconvex Faults , 1995, IEEE Trans. Computers.

[8]  Steven L. Scott,et al.  The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus , 1996 .

[9]  José Duato,et al.  A Necessary and Sufficient Condition for Deadlock-Free Routing in Cut-Through and Store-and-Forward Networks , 1996, IEEE Trans. Parallel Distributed Syst..

[10]  Andrew A. Chien,et al.  Compressionless Routing: A Framework for Adaptive and Fault-Tolerant Routing , 1997, IEEE Trans. Parallel Distributed Syst..

[11]  Mike Galles Spider: a high-speed network interconnect , 1997, IEEE Micro.

[12]  Sarita V. Adve,et al.  RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors , 1997 .

[13]  Li Zhang,et al.  Fault tolerant networks with small degree , 2000, SPAA '00.

[14]  Jau-Der Shih,et al.  Wormhole routing for torus networks with faults , 2001, Parallel Comput..

[15]  Cruz Izu,et al.  The Adaptive Bubble Router , 2001, J. Parallel Distributed Comput..

[16]  R. Beivide,et al.  A New Routing Mechanism for Networks with Irregular Topology , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[17]  Shubhendu S. Mukherjee,et al.  The Alpha 21364 network architecture , 2001, HOT 9 Interconnects. Symposium on High Performance Interconnects.

[18]  Valentin Puente,et al.  SICOSYS: an integrated framework for studying interconnection network performance in multiprocessor systems , 2002, Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing.