Dynamic reconfiguration in high-speed computer clusters

High-speed local and system area networks may change their topology due to switching on/off of routers and hosts or due to component failures. In such cases, a reconfiguration algorithm must be executed to restore the network connectivity and thus achieve high system reliability. However, most of the solutions are based either on redundant network paths or on regular network topologies.The purpose of this paper is to specify NetRec, a novel algorithm for dynamically reconfiguring an arbitrary network topology when a permanent node fault occurs. Unlike other reconfiguration algorithms, NetRec is applicable for all high-speed computer networks and is compatible with all modern routing techniques, including wormhole-based system area networks. It restores the network connectivity by building a tree that spans all immediate neighbors of the faulty node that are still connected to the network. The algorithm is distributed and does not require any global knowledge.

[1]  Michael D. Schroeder,et al.  Automatic reconfiguration in Autonet , 1991, SOSP '91.

[2]  William Joel Watson,et al.  Performance Modeling of ServerNetTM Topologies , 1996 .

[3]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[4]  Seif Haridi,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[5]  Robert W. Horst TNet: A Reliable System Area Network , 1995, IEEE Micro.

[6]  Robert W. Horst,et al.  A flexible ServerNet-based fault-tolerant architecture , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[7]  Jau-Der Shih Adaptive fault-tolerant wormhole routing for torus networks , 1998, Proceedings 1998 International Conference on Parallel and Distributed Systems (Cat. No.98TB100250).

[8]  G.D. Pifarre,et al.  Fully Adaptive Minimal Deadlock-Free Packet Routing in Hypercubes, Meshes, and other Networks: Algorithms and Simulations , 1994, IEEE Trans. Parallel Distributed Syst..

[9]  Camino de Vera High-Performance Routing in Networks of Workstations with Irregular Topology , 2000 .

[10]  Lionel M. Ni,et al.  Adaptive routing in irregular networks using cut-through switches , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[11]  José Duato,et al.  Dynamic Reconfiguration in High Speed Local Area Networks , 2000 .

[12]  Sreekaanth S. Isloor,et al.  The Deadlock Problem: An Overview , 1980, Computer.

[13]  Paolo Palazzari,et al.  An adaptive deadlock and livelock free routing algorithm , 1995, Proceedings Euromicro Workshop on Parallel and Distributed Processing.

[14]  B. Horst,et al.  Performance Modeling of ServerNet TM Topologies , 1996 .

[15]  Ted H. Szymanski,et al.  An analysis of deflection routing in multi-dimensional regular mesh networks , 1991, IEEE INFCOM '91. The conference on Computer Communications. Tenth Annual Joint Comference of the IEEE Computer and Communications Societies Proceedings.

[16]  Luis Gravano,et al.  Adaptive deadlock- and livelock-free routing with all minimal paths in Torus networks , 1992, SPAA '92.

[17]  Daniel H. Linder,et al.  An Adaptive and Fault Tolerant Wormhole Routing Strategy for k-Ary n-Cubes , 1994, IEEE Trans. Computers.

[18]  Sudhakar Yalamanchili,et al.  Adaptive routing protocols for hypercube interconnection networks , 1993, Computer.

[19]  Federico Silla,et al.  On the Use of Virtual Channels in Networks of Workstations with Irregular Topology , 1997, PCRCW.

[20]  Dimiter R. Avresky,et al.  Optimizing router arbitration in point-to-point networks , 1999, Comput. Commun..

[21]  Robert W. Horst,et al.  ServerNet deadlock avoidance and fractahedral topologies , 1996, Proceedings of International Conference on Parallel Processing.

[22]  T GaughanPatrick,et al.  Adaptive routing protocols for hypercube interconnection networks , 1993 .

[23]  Farnam Jahanian,et al.  Experimental study of Internet stability and backbone failures , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[24]  Michael Burrows,et al.  Autonet: A High-Speed, Self-Configuring Local Area Network Using Point-to-Point Links , 1991, IEEE J. Sel. Areas Commun..