Hypercube network fault tolerance: a probabilistic approach

Extensive experience has shown that hypercube networks are highly fault tolerant. What is frustrating is that it seems very difficult to properly formulate and formally prove this important fact, despite extensive research efforts in the past two decades. Most proposed fault tolerance models for hypercube networks are only able to characterize very rare extreme situations thus significantly underestimating the fault tolerance power of hypercube networks, while for more realistic fault tolerance models, the analysis becomes much more complicated. We develop new techniques to analyze a realistic fault tolerance model and derive lower bounds for the probability of hypercube network fault tolerance. Our results are both theoretically significant and practically important. Theoretically, our method offers very general and powerful techniques for formally proving lower bounds on the probability of network connectivity, while practically, our results provide formally proven and precisely given upper bounds on node failure probabilities for manufacturers to achieve a desired probability for network connectivity. Our techniques are also useful for analysis of the performance of routing algorithms.

[1]  Dhiraj K. Pradhan,et al.  Fault-tolerant computing : theory and techniques , 1986 .

[2]  Leslie G. Valiant,et al.  A Scheme for Fast Parallel Communication , 1982, SIAM J. Comput..

[3]  Jianer Chen,et al.  Locally Subcube-Connected Hypercube Networks: Theoretical Analysis and Experimental Results , 2002, IEEE Trans. Computers.

[4]  Sy-Yen Kuo,et al.  Fault tolerance in hyperbus and hypercube multiprocessors using partitioning scheme , 1994, Proceedings of 1994 International Conference on Parallel and Distributed Systems.

[5]  Dhiraj K. Pradhan,et al.  Fault-tolerant computing: theory and techniques; vol. 1 , 1986 .

[6]  Shietung Peng,et al.  Optimal Algorithms for Node-to-Node Fault Tolerant Routing in Hypercubes , 1996, Comput. J..

[7]  W. Daniel Hillis,et al.  The connection machine , 1985 .

[8]  Jörg Liebeherr,et al.  HyperCast: A Protocol for Maintaining Multicast Group Members in a Logical Hypercube Topology , 1999, Networked Group Communication.

[9]  Abdol-Hossein Esfahanian,et al.  Generalized Measures of Fault Tolerance with Application to N-Cube Networks , 1989, IEEE Trans. Computers.

[10]  Frank Thomson Leighton,et al.  Fast computation using faulty hypercubes , 1989, STOC '89.

[11]  Sudhakar Yalamanchili,et al.  Interconnection Networks: An Engineering Approach , 2002 .

[12]  Shietung Peng,et al.  Unicast in Hypercubes with Large Number of Faulty Nodes , 1999, IEEE Trans. Parallel Distributed Syst..

[13]  N. Kini,et al.  DESIGN AND COMPARISON OF TORUS EMBEDDED HYPERCUBE WITH MESH EMBEDDED HYPERCUBE INTERCONNECTION NETWORK , 2002 .

[14]  Sabine R. Öhring,et al.  Optimal fault-tolerant communication algorithms on product networks using spanning trees , 1994, Proceedings of 1994 6th IEEE Symposium on Parallel and Distributed Processing.

[15]  P. K. Bansal,et al.  A new fault tolerant multistage interconnection network , 2002, 2002 IEEE Region 10 Conference on Computers, Communications, Control and Power Engineering. TENCOM '02. Proceedings..

[16]  S. Lakshmivarahan,et al.  Ring, torus and hypercube architectures/algorithms for parallel computing , 1999, Parallel Comput..

[17]  John N. Tsitsiklis,et al.  The efficiency of greedy routing in hypercubes and butterflies , 1991, SPAA '91.

[18]  Guojun Wang,et al.  ROUTING IN HYPERCUBE NETWORKS WITH A CONSTANT FRACTION OF FAULTY NODES , 2000 .

[19]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[20]  Sartaj Sahni,et al.  An optimal routing algorithm for mesh-connected Parallel computers , 1980, JACM.

[21]  Dhiraj K. Pradhan,et al.  Fault-tolerant computer system design , 1996 .

[22]  Eli Upfal,et al.  A Theory of Wormhole Routing in Parallel Computers , 1996, IEEE Trans. Computers.

[23]  Frank Thomson Leighton Introduction to parallel algorithms and architectures: arrays , 1992 .

[24]  Sajal K. Das,et al.  Book Review: Introduction to Parallel Algorithms and Architectures : Arrays, Trees, Hypercubes by F. T. Leighton (Morgan Kauffman Pub, 1992) , 1992, SIGA.

[25]  A Louri,et al.  Scalable optical hypercube-based interconnection network for massively parallel computing. , 1994, Applied optics.

[26]  Bruce M. Maggs,et al.  Fast algorithms for bit-serial routing on a hypercube , 1990, SPAA '90.

[27]  Antonio Fernández,et al.  Generalized Algorithm for Parallel Sorting on Product Networks , 1997, IEEE Trans. Parallel Distributed Syst..

[28]  John Paul Shen Fault tolerance analysis of several interconnection networks , 1982, ICPP.

[29]  Jörg Liebeherr,et al.  A scalable control topology for multicast communications , 1998, Proceedings. IEEE INFOCOM '98, the Conference on Computer Communications. Seventeenth Annual Joint Conference of the IEEE Computer and Communications Societies. Gateway to the 21st Century (Cat. No.98.

[30]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[31]  Dhiraj K. Pradhan,et al.  A Fault-Tolerant Communication Architecture for Distributed Systems , 1982, IEEE Transactions on Computers.

[32]  Sheldon B. Akers,et al.  A Group-Theoretic Model for Symmetric Interconnection Networks , 1989, IEEE Trans. Computers.

[33]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[34]  Cauligi S. Raghavendra,et al.  Algorithms and Bounds for Shortest Paths and Diameter in Faulty Hypercubes , 1993, IEEE Trans. Parallel Distributed Syst..

[35]  Shahram Latifi,et al.  Conditional Connectivity Measures for Large Multiprocessor Systems , 1994, IEEE Trans. Computers.

[36]  Kai Hwang,et al.  Advanced computer architecture - parallelism, scalability, programmability , 1992 .

[37]  Shietung Peng,et al.  k-Pairwise Cluster Fault Tolerant Routing in Hypercubes , 1994, IEEE Trans. Computers.

[38]  M. H. Schultz,et al.  Topological properties of hypercubes , 1988, IEEE Trans. Computers.

[39]  Miltos D. Grammatikakis,et al.  Packet Routing in Fixed-Connection Networks: A Survey , 1998, J. Parallel Distributed Comput..

[40]  Behrooz Parhami,et al.  Optimal Architectures and Algorithms for Mesh-Connected Parallel Computers with Separable Row/Column Buses , 1993, IEEE Trans. Parallel Distributed Syst..

[41]  Alan Wagner,et al.  Embedding Trees in a Hypercube is NP-Complete , 1990, SIAM J. Comput..

[42]  Charles L. Seitz,et al.  The cosmic cube , 1985, CACM.

[43]  John P. Hayes,et al.  A Fault-Tolerant Communication Scheme for Hypercube Computers , 1992, IEEE Trans. Computers.

[44]  Jehoshua Bruck,et al.  Tolerating Faults in Hypercubes Using Subcube Partitioning , 1992, IEEE Trans. Computers.

[45]  Shahram Latifi Combinatorial Analysis of the Fault-Diameter of the n-cube , 1993, IEEE Trans. Computers.

[46]  Ahmed Louri,et al.  An optical multi-mesh hypercube: a scalable optical interconnection network for massively parallel computing , 1994 .

[47]  Michael O. Rabin,et al.  Efficient dispersal of information for security, load balancing, and fault tolerance , 1989, JACM.

[48]  Jie Wu,et al.  Fault Tolerance Measures for m-Ary n-Dimensional Hypercubes Based on Forbidden Faulty Sets , 1998, IEEE Trans. Computers.

[49]  Jean-Luc Gaudiot,et al.  Network Resilience: A Measure of Network Fault Tolerance , 1990, IEEE Trans. Computers.