Probabilistic analysis on mesh network fault tolerance

Mesh networks are among the most important interconnection network topologies for large multicomputer systems. Mesh networks perform poorly in tolerating faults in the view of worst-case analysis. On the other hand, such worst cases occur very rarely in realistic situations. In this paper, we study the fault tolerance of 2-D and 3-D mesh networks under a more realistic model in which each network node has an independent failure probability. We first observe that if the node failure probability is fixed, then the connectivity probability of these mesh networks can be arbitrarily small when the network size is sufficiently large. Thus, it is practically important for multicomputer system manufacture to determine the upper bound for node failure probability when the probability of network connectivity and the network size are given. We develop a novel technique to formally derive lower bounds on the connectivity probability for 2-D and 3-D mesh networks. Our study shows that these mesh networks of practical size can tolerate a large number of faulty nodes thus are reliable enough for multicomputer systems. For example, it is formally proved that as long as the node failure probability is bounded by 0.5%, a 3-D mesh network of up to a million nodes remains connected with a probability larger than 99%.

[1]  Yuanyuan Yang,et al.  Efficient all-to-all broadcast in all-port mesh and torus networks , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[2]  Frank Thomson Leighton Introduction to parallel algorithms and architectures: arrays , 1992 .

[3]  Bella Bose,et al.  Fault-Tolerant Communication Algorithms in Toroidal Networks , 1999, IEEE Trans. Parallel Distributed Syst..

[4]  Suresh Chalasani,et al.  Fault-Tolerant Wormhole Routing Algorithms for Mesh Networks , 1995, IEEE Trans. Computers.

[5]  Nick Knupffer Intel Corporation , 2018, The Grants Register 2019.

[6]  Sajal K. Das,et al.  Book Review: Introduction to Parallel Algorithms and Architectures : Arrays, Trees, Hypercubes by F. T. Leighton (Morgan Kauffman Pub, 1992) , 1992, SIGA.

[7]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[8]  Jean-Luc Gaudiot,et al.  Network Resilience: A Measure of Network Fault Tolerance , 1990, IEEE Trans. Computers.

[9]  Nian-Feng Tzeng,et al.  Allocating Precise Submeshes in Mesh Connected Systems , 1994, IEEE Trans. Parallel Distributed Syst..

[10]  Allan Porterfield,et al.  The Tera computer system , 1990, ICS '90.

[11]  Ge-Ming Chiu,et al.  A Fault-Tolerant Routing Scheme for Meshes with Nonconvex Faults , 2001, IEEE Trans. Parallel Distributed Syst..

[12]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[13]  Tao Wang,et al.  Fault tolerance of multicomputer networks: a probabilistic approach , 2002 .

[14]  Tong Liu,et al.  A Submesh Allocation Scheme for Mesh-Connected Multiprocessor Systems , 1995, ICPP.

[15]  Kai Hwang,et al.  Advanced computer architecture - parallelism, scalability, programmability , 1992 .

[16]  Jianer Chen,et al.  Hypercube network fault tolerance: a probabilistic approach , 2002, Proceedings International Conference on Parallel Processing.

[17]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[18]  Young-Joo Suh,et al.  All-To-All Communication with Minimum Start-Up Costs in 2D/3D Tori and Meshes , 1998, IEEE Trans. Parallel Distributed Syst..

[19]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[20]  Lionel M. Ni,et al.  Fault-tolerant wormhole routing in meshes , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[21]  Rainer Hoch,et al.  From paper to office document standard representation , 1992, Computer.

[22]  Taisook Han,et al.  Fault-Tolerant Wormhole Routing in Mesh with Overlapped Solid Fault Regions , 1997, Parallel Comput..

[23]  Chita R. Das,et al.  A Fast and Efficient Processor Allocation Scheme for Mesh-Connected Multicomputers , 2002, IEEE Trans. Computers.

[24]  Tom Blank,et al.  The MasPar MP-1 architecture , 1990, Digest of Papers Compcon Spring '90. Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage.

[25]  Ajay K. Royyuru,et al.  Blue Gene: A vision for protein science using a petaflop supercomputer , 2001, IBM Syst. J..

[26]  Prasant Mohapatra,et al.  An Efficient Method for Approximating Submesh Reliability of Two-Dimensional Meshes , 1998, IEEE Trans. Parallel Distributed Syst..