The Design and Implementation of a Fault-Tolerant Cluster Manager

Cluster management middleware schedules tasks on a cluster, controls access to shared resources, provides for task submission and monitoring, and coordinates the cluster’s fault tolerance mechanisms. Thus, reliable continuous operation of the management middleware is a prerequisite to the reliable operation of the cluster. Hence, the management middleware should tolerate a wide class of faults with minimal interruptions to management operations. This paper describes design considerations and implementation details of cluster mangement middleware for high performance computing in space, where fault rates are significantly higher than for earth-bound systems. We describe key detection, recovery, and reconfiguration mechanisms for different components of the system. The system is based on centralized decision making. Unlike other systems, the decision making capability is protected by active replication and the ability to restore the decision maker to full operational and fault tolerance capabilities following node failure. The management middleware is used to provide the application tasks with an out-of-band signaling capability that can be a key building block for application-level fault tolerance mechanisms. The middleware described has been implemented as part of the UCLA FaultTolerant Cluster Testbed (FTCT) project. Based on measurements of this implementation, we present preliminary evaluation of the overheads incurred by the management middleware.

[1]  Priya Narasimhan,et al.  The Eternal system: an architecture for enterprise applications , 1999, Proceedings Third International Enterprise Distributed Object Computing. Conference (Cat. No.99EX366).

[2]  William H. Sanders,et al.  AQuA: an adaptive architecture that provides dependable distributed objects , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[3]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[4]  Jonathan Robinson,et al.  Hector: an agent based architecture for dynamic resource management , 1999, IEEE Concurr..

[5]  Amnon Barak,et al.  The MOSIX multicomputer operating system for high performance cluster computing , 1998, Future Gener. Comput. Syst..

[6]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[7]  Robbert van Renesse,et al.  Horus: a flexible group communication system , 1996, CACM.

[8]  Jingwen Wang,et al.  Utopia: A load sharing facility for large, heterogeneous distributed computer systems , 1993, Softw. Pract. Exp..

[9]  Yuval Tamir,et al.  FAULT-TOLERANT CLUSTER MANAGEMENT FOR RELIABLE HIGH-PERFORMANCE COMPUTING , 2001 .

[10]  P. A. Barrett Delta-4: an open architecture for dependable systems , 1993 .

[11]  Flaviu Cristian,et al.  A performance comparison of asynchronous atomic broadcast protocols , 1994, Distributed Syst. Eng..

[12]  Louise E. Moser,et al.  The Totem system , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[13]  Danny Dolev,et al.  The Transis approach to high availability cluster communication , 1996, CACM.

[14]  Amin Vahdat,et al.  GLUix: a global layer unix for a network of workstations , 1998, Softw. Pract. Exp..

[15]  P. Reynier,et al.  Active replication in Delta-4 , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[16]  Miguel Castro,et al.  Proactive recovery in a Byzantine-fault-tolerant system , 2000, OSDI.

[17]  Kenneth P. Birman,et al.  The process group approach to reliable distributed computing , 1992, CACM.

[18]  Geoffrey C. Fox,et al.  A Review of Commercial and Research Cluster Management Software , 1996 .

[19]  Henri E. Bal,et al.  An efficient reliable broadcast protocol , 1989, OPSR.

[20]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[21]  Ravishankar K. Iyer,et al.  Chameleon: A Software Infrastructure for Adaptive Fault Tolerance , 1999, IEEE Trans. Parallel Distributed Syst..

[22]  Fred B. Schneider,et al.  Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.