GATOSTAR: A Fault Tolerant Load Sharing Facility for Parallel Applications

This paper presents how and why to unify load sharing and fault tolerance facilities. A realization of a fault tolerant load sharing facility, GatoStar, is presented and discussed. It is based on the integration of two applications developed on top of Unix: Gatos and Star. Gatos is a load sharing manager which automatically distributes parallel applications among heterogeneous hosts according to multicriteria allocation algorithms. Star is a software fault tolerance manager which automatically recovers processes of faulty machines based on checkpointing and message logging. The main advantage of this approach is to increase fault tolerant performance by taking advantage of the load sharing policies when allocating or recovering processes. This unification not only improves the efficiency of both facilities but avoids many redundancies mechanisms between them. Indeed, each facility needs to manage at least three common features: global knowledge of the running processors, a crash detection mechanism and remote process management. The backbone of this unification is a logical ring of communication for host crash detection and for host related information transfer. Thus, all necessary information is acquired with a relatively low cost of messages compared to the two systems taken independently.

[1]  D. Morris,et al.  A non-intrusive checkpointing protocol , 1989, Eighth Annual International Phoenix Conference on Computers and Communications. 1989 Conference Proceedings.

[2]  Philip A. Bernstein,et al.  Concurrency Control in Distributed Database Systems , 1986, CSUR.

[3]  Jingwen Wang,et al.  Utopia: A load sharing facility for large, heterogeneous distributed computer systems , 1993, Softw. Pract. Exp..

[4]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[5]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[6]  Michel Ruffin,et al.  KITLOG: a Generic Logging Service , 1992, SRDS.

[7]  Paulo Veríssimo,et al.  The Delta-4 approach to dependability in open distributed computing systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[8]  Hisao Kameda,et al.  Optimal static load balancing of multi-class jobs in a distributed computer system , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.

[9]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[10]  D. Powell,et al.  The Delta-4 Approach to Dependability in Open Distributed Computing Systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[11]  Alfred Z. Spector,et al.  Distributed logging for transaction processing , 1987, SIGMOD '87.

[12]  Edward D. Lazowska,et al.  Adaptive load sharing in homogeneous distributed systems , 1986, IEEE Transactions on Software Engineering.

[13]  Bruce M. McMillin,et al.  DAWGS - A Distributed Compute Server Utilizing Idle Workstations , 1992, J. Parallel Distributed Comput..

[14]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[15]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[16]  Harold S. Stone,et al.  Multiprocessor Scheduling with the Aid of Network Flow Algorithms , 1977, IEEE Transactions on Software Engineering.

[17]  Raouf Boutaba,et al.  Load Balancing in Local Area Networks , 1992, NETWORKS.

[18]  Al Geist,et al.  Network-based concurrent computing on the PVM system , 1992, Concurr. Pract. Exp..

[19]  Marvin Theimer,et al.  Finding Idle Machines in a Workstation-Based Distributed System , 1989, IEEE Trans. Software Eng..

[20]  Virginia Mary Lo,et al.  Task Assignment to Minimize Completion Time , 1985, IEEE International Conference on Distributed Computing Systems.

[21]  Guy Bernard,et al.  A Decentralized and Efficient Algorithm for Load Sharing in Networks of Workstations , 1991 .

[22]  Rafael Alonso,et al.  Sharing jobs among independently owned processors , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[23]  Kenneth P. Birman,et al.  Reliable communication in the presence of failures , 1987, TOCS.

[24]  Lorenzo Alvisi,et al.  Paralex: an environment for parallel programming in distributed systems , 1991, ICS '92.

[25]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[26]  R. S. Finlayson A log file service exploiting write-once storage , 1990 .

[27]  Wolfgang Graetsch,et al.  Fault tolerance under UNIX , 1989, TOCS.

[28]  S. Krishnaprasad,et al.  Software Allocation Models for Distributed Computing Systems , 1984, IEEE International Conference on Distributed Computing Systems.

[29]  Fred Douglis,et al.  Transparent process migration: Design alternatives and the sprite implementation , 1991, Softw. Pract. Exp..

[30]  Bharat K. Bhargava,et al.  Experimental evaluation of concurrent checkpointing and rollback-recovery algorithms , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[31]  Bertil Folliot Méthodes et outils de partage de charge pour la conception et la mise en oeuvre d'applications parallèles , 1992 .

[32]  Domenico Ferrari,et al.  An Empirical Investigation of Load Indices for Load Balancing Applications , 1987, Performance.

[33]  Wei-Tek Tsai,et al.  A low overhead checkpointing and rollback recovery scheme for distributed systems , 1989, Proceedings of the Eighth Symposium on Reliable Distributed Systems.

[34]  Pierre Sens,et al.  STAR: a fault-tolerant system for distributed applications , 1993, Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed Processing.