Design and validation of portable communication infrastructure for fault-tolerant cluster middleware

We describe the communication infrastructure (CI) for our fault-tolerant cluster middleware, which is optimized for two classes of communication: for the applications and for the cluster management middleware. This CI was designed for portability and for efficient operation on top of modern user-level message passing mechanisms. We present a functional fault model for the CI and show how platform-specific faults map to this fault model. Based on this fault model, we have developed a fault injection scheme that is integrated with the CI and is thus portable across different communication technologies. We have used fault injection to validate and evaluate the implementation of the CI itself as well as the cluster management middleware in the presence of communication faults.

[1]  Peter Steenkiste A systematic approach to host interface design for high-speed networks , 1994, Computer.

[2]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[3]  Jonathan Robinson,et al.  Hector: an agent based architecture for dynamic resource management , 1999, IEEE Concurr..

[4]  Ravishankar K. Iyer,et al.  NFTAPE: a framework for assessing dependability in distributed systems with lightweight fault injectors , 2000, Proceedings IEEE International Computer Performance and Dependability Symposium. IPDS 2000.

[5]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[6]  Thorsten von Eicken,et al.  Evolution of the Virtual Interface Architecture , 1998, Computer.

[7]  Amin Vahdat,et al.  GLUix: a global layer unix for a network of workstations , 1998, Softw. Pract. Exp..

[8]  Kang G. Shin,et al.  DOCTOR: an integrated software fault injection environment for distributed real-time systems , 1995, Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium.

[9]  Yuval Tamir,et al.  FAULT-TOLERANT CLUSTER MANAGEMENT FOR RELIABLE HIGH-PERFORMANCE COMPUTING , 2001 .

[10]  Yuval Tamir,et al.  The Design and Implementation of a Fault-Tolerant Cluster Manager , 2001 .

[11]  Ravishankar K. Iyer,et al.  Chameleon: A Software Infrastructure for Adaptive Fault Tolerance , 1999, IEEE Trans. Parallel Distributed Syst..

[12]  Farnam Jahanian,et al.  Testing of fault-tolerant and real-time distributed systems via protocol fault injection , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.