FT-DRB: A Method for Tolerating Dynamic Faults in High-Speed Interconnection Networks

The intensive and continuous use of high-performance computing systems for executing computationally intensive applications, coupled with the large number of elements that make them up, dramatically increase the likelihood of failures during their operation. The interconnection network is a critical part of such systems, therefore, network faults have an extremely high impact because most routing algorithms are not designed to tolerate faults. In such algorithms, just a single fault may stall messages in the network, preventing the finalization of applications, or may lead to deadlocked configurations. This paper introduces a novel fault-tolerant routing method provided with a new deadlock avoidance technique designed to solve an unbounded number of faults appearing at random during system operation. Our method provides escape paths for the stalled messages. In addition, the routing algorithm configures alternative paths to avoid the faulty areas taking advantage of communication path redundancy by means of multipath routing approaches. Deadlock avoidance is achieved by adding a small-sized queue and applying a simple set of actions when accessing output buffers with limited free space. Experiments show that our method allows applications to successfully finalize their execution in the presence of several number of faults, with an average performance value of 96% compared to the fault-free scenarios.

[1]  Bill Roscoe,et al.  Routing messages through networks: an exercise in deadlock avoidance , 1987 .

[2]  Sudhakar Yalamanchili,et al.  Interconnection Networks: An Engineering Approach , 2002 .

[3]  Emilio Luque,et al.  Distributed routing balancing for interconnection network communication , 1998, Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238).

[4]  Tor Skeie,et al.  Handling Multiple Faults in Wormhole Mesh Networks , 1998, Euro-Par.

[5]  José Duato,et al.  Segment-based routing: an efficient fault-tolerant routing algorithm for meshes and tori , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[6]  Valentin Puente,et al.  Immucube: Scalable Fault-Tolerant Routing for k-ary n-cube Networks , 2007, IEEE Transactions on Parallel and Distributed Systems.

[7]  Antonio Robles,et al.  A routing methodology for achieving fault tolerance in direct networks , 2006, IEEE Transactions on Computers.

[8]  Sudhakar Yalamanchili,et al.  Interconnection Networks , 2011, Encyclopedia of Parallel Computing.

[9]  S. Konstantinidou,et al.  Chaos router: architecture and performance , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[10]  Lawrence Snyder,et al.  The chaos router: a practical application of randomization in network routing , 1990, SPAA '90.

[11]  Leslie G. Valiant,et al.  Universal schemes for parallel communication , 1981, STOC '81.

[12]  Arie Shoshani,et al.  System Deadlocks , 1971, CSUR.

[13]  Tor Skeie,et al.  A Routing Methodology for Dynamic Fault Tolerance in Meshes and Tori , 2007, HiPC.

[14]  José Duato,et al.  A Necessary and Sufficient Condition for Deadlock-Free Routing in Cut-Through and Store-and-Forward Networks , 1996, IEEE Trans. Parallel Distributed Syst..

[15]  José Duato,et al.  CHAPTER 9 – Performance Evaluation , 2003 .

[16]  Philip Heidelberger,et al.  Blue Gene/L torus interconnection network , 2005, IBM J. Res. Dev..

[17]  Jean-Luc Gaudiot,et al.  International Parallel and Distributed Processing Symposium (IPDPS 2005) , 2006 .

[18]  José Duato CHAPTER 6 – Fault-Tolerant Routing , 2003 .

[19]  José Duato,et al.  A theory of fault-tolerant routing in wormhole networks , 1994, Proceedings of 1994 International Conference on Parallel and Distributed Systems.

[20]  Olav Lysne,et al.  Siamese-twin: a dynamically fault-tolerant fat-tree , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[21]  Mahmood Fathy,et al.  Characterization of spatial fault patterns in interconnection networks , 2006, Parallel Comput..

[22]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[23]  Emilio Luque,et al.  A Multipath Fault-Tolerant Routing Method for High-Speed Interconnection Networks , 2009, Euro-Par.

[24]  José Duato A Theory of Deadlock-Free Adaptive Multicast Routing in Wormhole Networks , 1995, IEEE Trans. Parallel Distributed Syst..

[25]  Cruz Izu,et al.  The Adaptive Bubble Router , 2001, J. Parallel Distributed Comput..

[26]  Larry J. Stockmeyer,et al.  A new approach to fault-tolerant wormhole routing for mesh-connected parallel computers , 2004, Proceedings 16th International Parallel and Distributed Processing Symposium.

[27]  Pedro López,et al.  An Efficient Fault-Tolerant Routing Methodology for Fat-Tree Interconnection Networks , 2007, ISPA.

[28]  K. Anjan,et al.  An efficient, fully adaptive deadlock recovery scheme: DISHA , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.