A new approach to fault-tolerant wormhole routing for mesh-connected parallel computers

A new method for fault-tolerant wormhole routing in arbitrary dimensional meshes is introduced. The method was motivated by certain routing requirements of an initial design of the Blue Gene supercomputer at IBM Research. The machine is organized as a three-dimensional mesh containing many thousands of nodes and the routing method should tolerate a few percent of the nodes being faulty. There has been much work on routing methods for meshes that route messages around faults or regions of faults. The new method is to declare certain nonfaulty nodes to be "lambs." A lamb is used for routing but not processing, so a lamb is neither the source nor the destination of a message. The lambs are chosen so that every "survivor node," a node that is neither faulty nor a lamb, can reach every survivor node by at most two rounds of dimension-ordered (such as e-cube) routing. An algorithm for finding a set of lambs is presented. The results of simulations on 2D and 3D meshes of various sizes with various numbers of random node faults are given. For example, on a 32 /spl times/ 32 /spl times/ 32 3D mesh with 3 percent random faults and using at most two rounds of e-cube routing for each message, the average number of lambs is less than 68, which is less than 7 percent of the number 983 of faults and less than 0.21 percent of the number 32,768 of nodes.

[1]  Dorit S. Hochbaum,et al.  Approximation Algorithms for NP-Hard Problems , 1996 .

[2]  Chita R. Das,et al.  Fault-Tolerant Routing in Mesh Networks , 1995, International Conference on Parallel Processing.

[3]  Suresh Chalasani,et al.  Fault-Tolerant Wormhole Routing Algorithms for Mesh Networks , 1995, IEEE Trans. Computers.

[4]  Johan Håstad,et al.  Some optimal inapproximability results , 2001, JACM.

[5]  Dorit S. Hochba,et al.  Approximation Algorithms for NP-Hard Problems , 1997, SIGA.

[6]  Suresh Chalasani,et al.  Communication in Multicomputers with Nonconvex Faults , 1995, IEEE Trans. Computers.

[7]  Sheng-De Wang,et al.  An Improved Algorithm for Fault-Tolerant Routing in Hypercubes , 1997, IEEE Trans. Computers.

[8]  Young-Joo Suh,et al.  Software Based Fault-Tolerant Oblivious Routing in Pipelined Networks , 1995, ICPP.

[9]  Ge-Ming Chiu,et al.  A Fault-Tolerant Routing Scheme for Meshes with Nonconvex Faults , 2001, IEEE Trans. Parallel Distributed Syst..

[10]  Ajay K. Royyuru,et al.  Blue Gene: A vision for protein science using a petaflop supercomputer , 2001, IBM Syst. J..

[11]  Lionel M. Ni,et al.  Fault-tolerant wormhole routing in meshes , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[12]  Larry J. Stockmeyer,et al.  A new approach to fault-tolerant wormhole routing for mesh-connected parallel computers , 2004, Proceedings 16th International Parallel and Distributed Processing Symposium.

[13]  William J. Dally,et al.  Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels , 1993, IEEE Trans. Parallel Distributed Syst..

[14]  Dan Gusfield,et al.  Chapter 8 Design (with analysis) of efficient algorithms , 1992 .

[15]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[16]  William J. Dally,et al.  Deadlock-Free Message Routing in Multiprocessor Interconnection Networks , 1987, IEEE Transactions on Computers.

[17]  José E. Moreira,et al.  Demonstrating the scalability of a molecular dynamics application on a Petaflop computer , 2001, ICS '01.

[18]  Reuven Bar-Yehuda,et al.  A Linear-Time Approximation Algorithm for the Weighted Vertex Cover Problem , 1981, J. Algorithms.

[19]  Andrew A. Chien,et al.  Planar-adaptive routing: low-cost adaptive networks for multiprocessors , 1992, ISCA '92.

[20]  ChalasaniSuresh,et al.  Fault-Tolerant Wormhole Routing Algorithms for Mesh Networks , 1995 .

[21]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .