On the Fault Tolerance of Fat-Trees

We examine the reliability properties of ideal fat-trees, a general model used to capture both distance and bandwidth constraints of various classes of fat-tree networks. We allow the edges and the vertices of the network to fail independently with probability f, and show that: (1) Any fat-tree G can always be partitioned into an upper (GH) and a lower (GL) part. After the faults, the remaining part of GL guarantees that a linear fraction of the leaves of the fat-tree still connect to the upper part, with high probability. (2) GH is robust, in the sense that, after the faults, at least half of the edge-disjoint paths between any set of “leaves” of GH are preserved with probability tending to 1, even in the case of failure probabilities as high as f < 0.25. The robust properties of GH hold for the case that fat-nodes do not have internal edges and also for the case that fat-nodes are random regular graphs. (3) For the special case of a pruned butterfly, there is a critical probability pc for the existence of a linear sized component surviving the failures and including a large fraction of terminal nodes. We show that pc ≥ 0.42.

[1]  Bruce M. Maggs,et al.  Fast Algorithms for Routing Around Faults in Multibutterflies and Randomly-Wired Splitter Networks , 1992, IEEE Trans. Computers.

[2]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[3]  B. Bollobás The evolution of random graphs , 1984 .

[4]  Paul G. Spirakis,et al.  Expander Properties in Random Regular Graphs with Edge Faults , 1995, STACS.

[5]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[6]  János Komlós,et al.  Largest random component of ak-cube , 1982, Comb..

[7]  Gianfranco Bilardi,et al.  Broadcast and Associative Operations on Fat-Trees , 1997, Euro-Par.

[8]  Sivan Toledo,et al.  Competitive fault-tolerance in area-universal networks , 1992, SPAA '92.

[9]  H. Kesten The critical probability of bond percolation on the square lattice equals 1/2 , 1980 .

[10]  Paul G. Spirakis,et al.  Short Vertex Disjoint Paths and Multiconnectivity in Random Graphs: Reliable Network Computing , 1994, ICALP.

[11]  Bruce M. Maggs,et al.  Universal packet routing algorithms , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[12]  Charles E. Leiserson,et al.  Randomized Routing on Fat-Trees , 1989, Adv. Comput. Res..

[13]  Gianfranco Bilardi,et al.  An Area Lower Bound for a Class of Fat-Trees (Extended Abstract) , 1994, ESA.

[14]  W. Daniel Hillis,et al.  The network architecture of the Connection Machine CM-5 (extended abstract) , 1992, SPAA '92.

[15]  Gianfranco Bilardi,et al.  Deterministic on-line routing on area-universal networks , 1995, JACM.

[16]  Anna R. Karlin,et al.  On the fault tolerance of the butterfly , 1994, STOC '94.

[17]  Richard Cole,et al.  Routing on butterfly networks with random faults , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[18]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[19]  Bruce M. Maggs,et al.  On the fault tolerance of some popular bounded-degree networks , 1992, Proceedings., 33rd Annual Symposium on Foundations of Computer Science.

[20]  Rajeev Motwani,et al.  Randomized algorithms , 1996, CSUR.

[21]  Hisao Tamaki,et al.  Efficient self-embedding of butterfly networks with random faults , 1992, Proceedings., 33rd Annual Symposium on Foundations of Computer Science.

[22]  Ronald I. Greenberg,et al.  The Fat-Pyramid and Universal Parallel Computation Independent of Wire Delay , 1994, IEEE Trans. Computers.

[23]  Paul Bay,et al.  An area-universal VLSI circuit , 1993 .

[24]  David R. Karger,et al.  Approximating s-t minimum cuts in Õ(n2) time , 1996, STOC '96.