Dynamic reconfiguration in computer clusters with irregular topologies in the presence of multiple node and link failures

Component failures in high-speed computer networks can result in significant topological changes. In such cases, a network reconfiguration algorithm must be executed to restore the connectivity between the network nodes. Most contemporary networks use either static reconfiguration algorithms or stop the user traffic in order to prevent cyclic dependencies in the routing tables. The goal is to present NetRec, a dynamic network reconfiguration algorithm for tolerating multiple node and link failures in high-speed networks with arbitrary topology. The algorithm updates the routing tables asynchronously and does not require any global knowledge about the network topology. Certain phases of NetRec are executed in parallel, which reduces the reconfiguration time. The algorithm suspends the application traffic in small regions of the network only while the routing tables are being updated. The message complexity of NetRec is analyzed and the termination, liveness, and safety of the proposed algorithm are proven. Additionally, results from validation of the algorithm in a distributed network-validation testbed Distant, based on the MPI 1.2 features for building arbitrary virtual topologies, are presented.

[1]  Lawrence Snyder,et al.  The chaos router: a practical application of randomization in network routing , 1990, SPAA '90.

[2]  Dimiter R. Avresky,et al.  Dynamic reconfiguration in high-speed computer clusters , 2001, Proceedings 42nd IEEE Symposium on Foundations of Computer Science.

[3]  Edward J. McCluskey,et al.  ED4I: Error Detection by Diverse Data and Duplicated Instructions , 2002, IEEE Trans. Computers.

[4]  Robert W. Horst TNet: A Reliable System Area Network , 1995, IEEE Micro.

[5]  Xiaola Lin,et al.  The Message Flow Model for Routing in Wormhole-Routed Networks , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[6]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[7]  Roy Friedman,et al.  Failure detectors in omission failure environments , 1997, PODC '97.

[8]  Lionel M. Ni,et al.  Adaptive routing in irregular networks using cut-through switches , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[9]  Dimiter R. Avresky,et al.  Single-source fault-tolerant broadcasting for two-dimensional meshes without virtual channels , 1997, Microprocess. Microsystems.

[10]  Luis Gravano,et al.  Adaptive deadlock- and livelock-free routing with all minimal paths in Torus networks , 1992, SPAA '92.

[11]  Noah Treuhaft,et al.  ROC-1: Hardware Support for Recovery-Oriented Computing , 2002, IEEE Trans. Computers.

[12]  Dimiter R. Avresky,et al.  Embedding and Reconfiguration of Spanning Trees in Faulty Hypercubes , 1999, IEEE Trans. Parallel Distributed Syst..

[13]  Sudhakar Yalamanchili,et al.  Adaptive routing protocols for hypercube interconnection networks , 1993, Computer.

[14]  Federico Silla,et al.  On the Use of Virtual Channels in Networks of Workstations with Irregular Topology , 1997, PCRCW.

[15]  Federico Silla,et al.  High-Performance Routing in Networks of Workstations with Irregular Topology , 2000, IEEE Trans. Parallel Distributed Syst..

[16]  Seif Haridi,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[17]  Ted H. Szymanski,et al.  An analysis of deflection routing in multi-dimensional regular mesh networks , 1991, IEEE INFCOM '91. The conference on Computer Communications. Tenth Annual Joint Comference of the IEEE Computer and Communications Societies Proceedings.

[18]  Robert W. Horst,et al.  A flexible ServerNet-based fault-tolerant architecture , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[19]  Dimiter R. Avresky,et al.  Dependable Network Computing , 1999 .

[20]  G.D. Pifarre,et al.  Fully Adaptive Minimal Deadlock-Free Packet Routing in Hypercubes, Meshes, and other Networks: Algorithms and Simulations , 1994, IEEE Trans. Parallel Distributed Syst..

[21]  William J. Dally,et al.  Deadlock-Free Message Routing in Multiprocessor Interconnection Networks , 1987, IEEE Transactions on Computers.

[22]  Dimiter R. Avresky,et al.  Single Source Fault-Tolerant Broadcasting for Two-Dimensional Meshes Without Virtual Channels , 1996, EDCC.

[23]  C.M. Cunningham,et al.  Fault-tolerant adaptive routing for two-dimensional meshes , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[24]  José Duato,et al.  Deadlock-Free Dynamic Reconfiguration Schemes for Increased Network Dependability , 2003, IEEE Trans. Parallel Distributed Syst..

[25]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[26]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[27]  D. Avresky,et al.  Adaptive Minimal-Path Routing in 2-Dimensional Torus Servernet San , 2000 .

[28]  Michael Burrows,et al.  Autonet: A High-Speed, Self-Configuring Local Area Network Using Point-to-Point Links , 1991, IEEE J. Sel. Areas Commun..

[29]  José Duato,et al.  A Protocol for Deadlock-Free Dynamic Reconfiguration in High-Speed Local Area Networks , 2001, IEEE Trans. Parallel Distributed Syst..

[30]  Daniel H. Linder,et al.  An Adaptive and Fault Tolerant Wormhole Routing Strategy for k-Ary n-Cubes , 1994, IEEE Trans. Computers.

[31]  José Duato,et al.  Dynamic Reconfiguration in High Speed Local Area Networks , 2000 .