Host Side Dynamic Reconfiguration with InfiniBand

Rerouting around faulty components and migration of jobs both require reconfiguration of data structures in the Queue Pairs residing in the hosts on an InfiniBand cluster. In this paper we report an implementation of dynamic reconfiguration of such host side data-structures. Our implementation preserves the Queue Pairs, and lets the application run without being interrupted. With this implementation, we demonstrate a complete solution to fault tolerance in an InfiniBand network, where dynamic network reconfiguration to a topology-agnostic routing function is used to avoid malfunctioning components. This solution is in principle able to let applications run uninterruptedly on the cluster, as long as the topology is physically connected. Through measurements on our test-cluster we show that the increased cost of our method in setup latency is negligible, and that there is only a minor reduction in throughput during reconfiguration.

[1]  Olav Lysne,et al.  Layered shortest path (LASH) routing in irregular system area networks , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[2]  José Duato,et al.  Deadlock-Free Dynamic Reconfiguration Schemes for Increased Network Dependability , 2003, IEEE Trans. Parallel Distributed Syst..

[3]  Suresh Chalasani,et al.  Communication in Multicomputers with Nonconvex Faults , 1995, IEEE Trans. Computers.

[4]  Jack J. Dongarra,et al.  HPC Challenge Benchmark , 2011, Encyclopedia of Parallel Computing.

[5]  Olav Lysne,et al.  Layered routing in irregular networks , 2006, IEEE Transactions on Parallel and Distributed Systems.

[6]  José Duato,et al.  An Efficient and Deadlock-Free Network Reconfiguration Protocol , 2008, IEEE Transactions on Computers.

[7]  Amin Vahdat,et al.  PortLand: a scalable fault-tolerant layer 2 data center network fabric , 2009, SIGCOMM '09.

[8]  Amith R. Mamidala,et al.  Automatic Path Migration over InfiniBand: Early Experiences , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[9]  José Duato,et al.  Epoch-based reconfiguration: Fast, simple, and effective dynamic network reconfiguration , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[10]  Olav Lysne,et al.  FRoots: A Fault Tolerant and Topology-Flexible Routing Technique , 2006, IEEE Transactions on Parallel and Distributed Systems.

[11]  Hideharu Amano,et al.  L-turn routing: an adaptive routing in irregular networks , 2001, International Conference on Parallel Processing, 2001..

[12]  Haitao Wu,et al.  BCube: a high performance, server-centric network architecture for modular data centers , 2009, SIGCOMM '09.

[13]  Suresh Chalasani,et al.  Fault-Tolerant Wormhole Routing Algorithms for Mesh Networks , 1995, IEEE Trans. Computers.

[14]  José Duato,et al.  A Protocol for Deadlock-Free Dynamic Reconfiguration in High-Speed Local Area Networks , 2001, IEEE Trans. Parallel Distributed Syst..

[15]  Dhabaleswar K. Panda,et al.  An efficient hardware-software approach to network fault tolerance with InfiniBand , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[16]  José Duato,et al.  A theory for deadlock-free dynamic network reconfiguration. Part I , 2005, IEEE Transactions on Parallel and Distributed Systems.

[17]  Hee Yong Youn,et al.  On performance evaluation of fault tolerant multistage interconnection networks , 1992, SAC '92.

[18]  Dimiter R. Avresky,et al.  Dynamic reconfiguration in high-speed computer clusters , 2001, Proceedings 42nd IEEE Symposium on Foundations of Computer Science.

[19]  Antonio Robles,et al.  LASH-TOR: a generic transition-oriented routing algorithm , 2004, Proceedings. Tenth International Conference on Parallel and Distributed Systems, 2004. ICPADS 2004..

[20]  Albert G. Greenberg,et al.  VL2: a scalable and flexible data center network , 2009, SIGCOMM '09.

[21]  Federico Silla,et al.  High-Performance Routing in Networks of Workstations with Irregular Topology , 2000, IEEE Trans. Parallel Distributed Syst..

[22]  Antonio Robles,et al.  A routing methodology for achieving fault tolerance in direct networks , 2006, IEEE Transactions on Computers.