Fault Tolerance in Hypercubes

This paper describes different schemes for tolerating faults in hypercube multiprocessors. A study of hypercube algorithms reveals that in many cases, the computations that require local communication are mapped onto topologies such as meshes or rings and the hypercube topology is used for global data communication. Therefore, a faulty hypercube needs to be reconfigured to perform both local and global communication as required by the algorithm, effectively and with minimal performance degradation. Two general approaches can be identified. The first approach looks into ways of utilizing the healthy processors and links of a hypercube with faulty nodes/links, for embedding topologies such as lower dimensional hypercubes, rings, meshes and trees for performing communication. The second approach makes use of hardware redundancy in the form of spare nodes and/or links and usually requires modifications in the communication hardware. Augmented hypercubes and spare allocation schemes are described.

[1]  F. Ozguner,et al.  Embeddings, Communication and Performance of Algorithms in Faulty Hypercubes , 1990, Proceedings of the Fifth Distributed Memory Computing Conference, 1990..

[2]  Suku Nair,et al.  Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor , 1990, IEEE Trans. Computers.

[3]  Bernd Becker,et al.  How robust is the n-cube? , 1986, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[4]  Yousef Saad,et al.  Data Communication in Hypercubes , 1989, J. Parallel Distributed Comput..

[5]  Theodore R. Bashkow,et al.  A large scale, homogeneous, fully distributed parallel machine, I , 1977, ISCA '77.

[6]  Wei-Tek Tsai,et al.  An efficient multi-dimensional grids reconfiguration algorithm on hypercube , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[7]  F. Harary,et al.  A survey of the theory of hypercube graphs , 1988 .

[8]  S. F. Nugent,et al.  The iPSC/2 direct-connect communications technology , 1988, C3P.

[9]  Rami G. Melhem,et al.  A Distributed Algorithm for Embedding Trees in Hypercubes with Modifications for Run-Time Fault Tolerance , 1992, J. Parallel Distributed Comput..

[10]  Chung-Chi Jim Li,et al.  Graceful Degradation on Hypercube Multiprocessors Using Data Redistribution , 1990, Proceedings of the Fifth Distributed Memory Computing Conference, 1990..

[11]  Roy M. Jenevein,et al.  Scaleability of a Binary Tree on a Hypercube , 1986, ICPP.

[12]  S. Lennart Johnsson,et al.  Optimum Broadcasting and Personalized Communication in Hypercubes , 1989, IEEE Trans. Computers.

[13]  Mee Yee Chan,et al.  Distributed Fault-Tolerant Embeddings of Rings in Hypercubes , 1990, J. Parallel Distributed Comput..

[14]  Tze Chiang Lee Quick Recovery of Embedded Structures in Hypercube Computers , 1990, Proceedings of the Fifth Distributed Memory Computing Conference, 1990..

[15]  P. Sadayappan,et al.  Iterative Algorithms for Solution of Large Sparse Systems of Linear Equations on Hypercubes , 1988, IEEE Trans. Computers.

[16]  M. H. Schultz,et al.  Topological properties of hypercubes , 1988, IEEE Trans. Computers.

[17]  Mee Yee Chan,et al.  Fault-Tolerant Embedding of Complete Binary Trees in Hypercubes , 1993, IEEE Trans. Parallel Distributed Syst..

[18]  F. Ozguner,et al.  Implementation of the conjugate gradient algorithm on a vector hypercube multiprocessor , 1989, C3P.

[19]  Rami G. Melhem,et al.  An Efficient Modular Spare Allocation Scheme and Its Application to Fault Tolerant Binary Hypercubes , 1991, IEEE Trans. Parallel Distributed Syst..

[20]  Cevdet Aykanat,et al.  A Reconfiguration Algorithm for Fault Tolerance in a Hypercube Multiprocessor , 1988, Information Processing Letters.

[21]  Rami Melhem,et al.  BINARY TREES AND RINGS IN HYPERCUBES , 1989 .

[22]  Arthur L. Liestman,et al.  A proposal for a fault-tolerant binary hypercube architecture , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[23]  F. Ozguner,et al.  Spare Allocation and Reconfiguration in a Fault Tolerant Hypercube with Direct Connect Capability , 1991, The Sixth Distributed Memory Computing Conference, 1991. Proceedings.

[24]  Prithviraj Banerjee Strategies for reconfiguring hypercubes under faults , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[25]  Mee Yee Chan,et al.  Distributed Fault-Tolerant Embedding of Rings in Hypercubes , 1990, Proceedings of the Fifth Distributed Memory Computing Conference, 1990..

[26]  Rami G. Melhem,et al.  Channel Multiplexing in Modular Fault Tolerant Multiprocessors , 1991, ICPP.

[27]  R. Melhem,et al.  Fault tolerance and reliable routing in augmented hypercube architectures , 1989, Eighth Annual International Phoenix Conference on Computers and Communications. 1989 Conference Proceedings.

[28]  Angela Y. Wu,et al.  Embedding of tree networks into hypercubes , 1985, J. Parallel Distributed Comput..

[29]  J. P. Hayes,et al.  Routing and broadcasting in faulty hypercube computers , 1988, C3P.