Fault-tolerant Routing for Multiple Permanent and Non-permanent Faults in HPC Systems

The interconnection network communicates and links together the processing units of modern highperformance computing systems. In this context, network faults have an extremely high impact since most routing algorithms were not designed to tolerate faults. Because of this, just a single fault may stall messages in the network, preventing the finalization of applications, or may lead to deadlocked configurations. In this paper we introduce a fault-tolerant routing method designed to solve a large number of dynamic permanent and non-permanent link faults. As failures appear randomly during system operation, our method provides escape paths for the stalled messages and, at the same time, avoids deadlock occurrences. Our proposal avoids faulty areas by means of multipath routing approaches, taking advantage of the communication path redundancy, as long as alternative paths are available. Performance evaluation consists of synthetic test scenarios for proving correctness, and test scenarios based on the availability traces of real high-performance systems. Experiments show that our method allows applications to successfully complete their executions even in the presence of a large number of faults, given performance degradations below 3% for a 1024-node system with up to 200 simultaneous link failures.

[1]  Ted Taylor Los Alamos National Laboratory , 2005 .

[2]  Emilio Luque,et al.  FT-DRB: A Method for Tolerating Dynamic Faults in High-Speed Interconnection Networks , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[3]  Tor Skeie,et al.  A Routing Methodology for Dynamic Fault Tolerance in Meshes and Tori , 2007, HiPC.

[4]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[5]  Valentin Puente,et al.  Immucube: Scalable Fault-Tolerant Routing for k-ary n-cube Networks , 2007, IEEE Transactions on Parallel and Distributed Systems.

[6]  Garth A. Gibson,et al.  The Computer Failure Data Repository ( CFDR ) , 2006 .

[7]  Olav Lysne,et al.  Siamese-twin: a dynamically fault-tolerant fat-tree , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[8]  José Duato,et al.  An Efficient and Deadlock-Free Network Reconfiguration Protocol , 2008, IEEE Transactions on Computers.

[10]  Pedro López,et al.  An Efficient Fault-Tolerant Routing Methodology for Fat-Tree Interconnection Networks , 2007, ISPA.

[11]  Mostafa Abd-El-Barr,et al.  Design and analysis of reliablle and fault-tolerant computer systems , 2007 .

[12]  Emilio Luque,et al.  Deadlock Avoidance for Interconnection Networks with Multiple Dynamic Faults , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[13]  Sudhakar Yalamanchili,et al.  Interconnection Networks: An Engineering Approach , 2002 .

[14]  Emilio Luque,et al.  Distributed routing balancing for interconnection network communication , 1998, Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238).

[15]  Sudhakar Yalamanchili,et al.  Interconnection Networks , 2011, Encyclopedia of Parallel Computing.

[16]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.