Topology Agnostic Dynamic Quick Reconfiguration for Large-Scale Interconnection Networks

Toleration of faults in the interconnection networks is of vital importance in to days huge computer installations. Still, the existing solutions are short of being satisfactory. They require that the system defaults into a routing algorithm that is inferior to the original, either in terms of performance, or in terms of the need for virtual channels, or both. Furthermore, since support for dynamic reconfiguration is not supported in current hardware, existing methods require the system to be halted while reconfiguration takes place in order to avoid deadlocks. In this paper we present a method that efficiently generates a new routing function in the presence of faults. The new routing function only reroutes the traffic that is affected by the fault, so that the performance of the original routing function is preserved to the extent possible. No specific functionality in the switches is required, we only require exactly the same number of virtual channels in the presence of faults as the original routing algorithm did. Finally, the new routing function is compatible with the old one, so that deadlock free dynamic transition between the old and the new routing function is immediately available. This means that our solution can easily be implemented on current InfiniBand platforms, e.g. through the OFED software stack. We demonstrate that the method is workable for meshes, tori and fat-trees, and that it is able to guarantee one-fault tolerance for all of these topologies.

[1]  José Duato,et al.  A theory of fault-tolerant routing in wormhole networks , 1994, Proceedings of 1994 International Conference on Parallel and Distributed Systems.

[2]  Torsten Hoefler,et al.  Optimized Routing for Large-Scale InfiniBand Networks , 2009, 2009 17th IEEE Symposium on High Performance Interconnects.

[3]  Olav Lysne,et al.  One-fault tolerance arid beyond in wormhole routed meshes , 1998, Microprocess. Microsystems.

[4]  José Duato,et al.  Fast dynamic reconfiguration in irregular networks , 2000, Proceedings 2000 International Conference on Parallel Processing.

[5]  Suresh Chalasani,et al.  Communication in Multicomputers with Nonconvex Faults , 1995, IEEE Trans. Computers.

[6]  José Duato,et al.  Dynamic Fault Tolerance in Fat Trees , 2011, IEEE Transactions on Computers.

[7]  Olav Lysne,et al.  RecTOR: A New and Efficient Method for Dynamic Network Reconfiguration , 2009, Euro-Par.

[8]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[9]  José Duato,et al.  A theory for deadlock-free dynamic network reconfiguration. Part I , 2005, IEEE Transactions on Parallel and Distributed Systems.

[10]  Andrew A. Chien,et al.  Planar-adaptive routing: low-cost adaptive networks for multiprocessors , 1992, ISCA '92.

[11]  José Duato,et al.  A methodology for developing dynamic network reconfiguration processes , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[12]  Larry J. Stockmeyer,et al.  A new approach to fault-tolerant wormhole routing for mesh-connected parallel computers , 2004, Proceedings 16th International Parallel and Distributed Processing Symposium.

[13]  Lionel M. Ni,et al.  International Conference on Parallel and Distributed Systems , 1994 .

[14]  Suresh Chalasani,et al.  Fault-tolerant wormhole routing in tori , 1994, ICS '94.

[15]  Olav Lysne,et al.  Layered routing in irregular networks , 2006, IEEE Transactions on Parallel and Distributed Systems.

[16]  José Duato,et al.  Simple Deadlock-Free Dynamic Network Reconfiguration , 2004, HiPC.

[17]  Antonio Robles,et al.  A routing methodology for achieving fault tolerance in direct networks , 2006, IEEE Transactions on Computers.

[18]  Lionel M. Ni,et al.  The Turn Model for Adaptive Routing , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[19]  José Duato,et al.  A methodology for developing deadlock-free dynamic network reconfiguration processes. Part II , 2005, IEEE Transactions on Parallel and Distributed Systems.

[20]  Tor Skeie,et al.  A Routing Methodology for Dynamic Fault Tolerance in Meshes and Tori , 2007, HiPC.

[21]  José Duato,et al.  Deadlock-Free Dynamic Reconfiguration Schemes for Increased Network Dependability , 2003, IEEE Trans. Parallel Distributed Syst..

[22]  José Duato,et al.  A Protocol for Deadlock-Free Dynamic Reconfiguration in High-Speed Local Area Networks , 2001, IEEE Trans. Parallel Distributed Syst..