Fault-Tolerant Adaptive and Minimal Routing in Mesh-Connected Multicomputers Using Extended Safety Levels

The minimal routing problem in mesh-connected multicomputers with faulty blocks is studied. Two-dimensional meshes are used to illustrate the approach. A sufficient condition for minimal routing in 2D meshes with faulty blocks is proposed. Unlike many traditional models that assume all the nodes know global fault distribution, our approach is based on the concept of an extended safety level, which is a special form of limited fault information. The extended safety level information is captured by a vector associated with each node. When the safety level of a node reaches a certain level (or meets certain conditions), a minimal path exists from this node to any nonfaulty nodes in 2D meshes. Specifically, we study the existence of minimal paths at a given source node, limited distribution of fault information, and minimal routing itself. We propose three fault-tolerant minimal routing algorithms which are adaptive to allow all messages to use any minimal path. We also provide some general ideas to extend our approaches to other low-dimensional mesh-connected multicomputers such as 2D tori and 3D meshes. Our approach is the first attempt to address adaptive and minimal routing in 2D meshes with faulty blocks using limited fault information.

[1]  Wei-Tek Tsai,et al.  Fault-Tolerant Multicasting on Hypercubes , 1994, J. Parallel Distributed Comput..

[2]  Jie Wu,et al.  Reliable broadcasting in faulty hypercube computers , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[3]  Sudhakar Yalamanchili,et al.  Distributed Deadlock-Free Routing in Faulty, Pipelined, Direct Interconnection Networks , 1996, IEEE Trans. Computers.

[4]  Sigurd L. Lillevik,et al.  The Touchstone 30 Gigaflop DELTA Prototype , 1991, The Sixth Distributed Memory Computing Conference, 1991. Proceedings.

[5]  Kang G. Shin,et al.  Adaptive Fault-Tolerant Deadlock-Free Routing in Meshes and Hypercubes , 1996, IEEE Trans. Computers.

[6]  Andrew A. Chien,et al.  Planar-adaptive routing: low-cost adaptive networks for multiprocessors , 1992, ISCA '92.

[7]  Eric C. Rosen,et al.  The New Routing Algorithm for the ARPANET , 1980, IEEE Trans. Commun..

[8]  Jie Wu,et al.  Adaptive Fault-Tolerant Routing in Cube-Based Multicomputers Using Safety Vectors , 1998, IEEE Trans. Parallel Distributed Syst..

[9]  Lionel M. Ni,et al.  Fault-tolerant routing in hypercube multicomputers using local safety information , 1996 .

[10]  Sudhakar Yalamanchili,et al.  Interconnection Networks: An Engineering Approach , 2002 .

[11]  C.M. Cunningham,et al.  Fault-tolerant adaptive routing for two-dimensional meshes , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[12]  John P. Hayes,et al.  A Fault-Tolerant Communication Scheme for Hypercube Computers , 1992, IEEE Trans. Computers.

[13]  Suresh Chalasani,et al.  Fault-Tolerant Wormhole Routing Algorithms for Mesh Networks , 1995, IEEE Trans. Computers.

[14]  Chita R. Das,et al.  Fault-Tolerant Routing in Mesh Networks , 1995, International Conference on Parallel Processing.

[15]  Young-Joo Suh,et al.  Software Based Fault-Tolerant Oblivious Routing in Pipelined Networks , 1995, ICPP.

[16]  Ge-Ming Chiu,et al.  A Fault-Tolerant Routing Strategy in Hypercube Multicomputers , 1996, IEEE Trans. Computers.

[17]  Jie Wu,et al.  Broadcasting in faulty hypercubes , 1993, Microprocess. Microprogramming.

[18]  J. Jubin,et al.  The DARPA packet radio network protocols , 1987, Proceedings of the IEEE.

[19]  Ran Libeskind-Hadas,et al.  Origin-based fault-tolerant routing in the mesh , 1995, Future Gener. Comput. Syst..

[20]  Martin Walker,et al.  A Shared Memory MPP from Cray Research , 1994, Digit. Tech. J..

[21]  Jie Wu Reliable Unicasting in Faulty Hypercubes Using Safety Levels , 1997, IEEE Trans. Computers.

[22]  Alain J. Martin,et al.  The architecture and programming of the Ametek series 2010 multicomputer , 1988, C3P.

[23]  William J. Dally,et al.  The J-Machine: System Support for Actors , 1988 .