Load balancing and fault tolerance in workstation clusters migrating groups of communicating processes

In the past, several process migration facilities for distributed systems have been developed. Due to the complex nature of the subject, all those facilities have limitations that make them usable for only limited classes of applications and environments. We discuss some of the usual limitations and possible solutions. Specifically, we focus on migration of groups of collaborating processes between Unix systems without kernel modifications, and from this we derive the design for a migration system. First experiences with our implementation show that we reach performance figures for the migration that are close to those of real distributed operating system.

[1]  Raphael A. Finkel,et al.  Designing a process migration facility: the Charlotte experience , 1989, Computer.

[2]  Rafael Alonso,et al.  A Process Migration Implementation for a Unix System , 1988, USENIX Winter.

[3]  Pankaj Mehra,et al.  Automated learning of load-balancing strategies for a distributed computer system , 1993 .

[4]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[5]  Amnon Barak,et al.  The MOSIX Distributed Operating System , 1993, Lecture Notes in Computer Science.

[6]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[7]  Edward D. Lazowska,et al.  The limited performance benefits of migrating active processes for load sharing , 1988, SIGMETRICS '88.

[8]  Jonathan M. Smith,et al.  A survey of process migration mechanisms , 1988, OPSR.

[9]  Chad Hunter,et al.  Process Cloning: A System for Duplicating UNIX Processes , 1988, USENIX Winter.

[10]  Vaidy S. Sunderam,et al.  Process Migration in UNIX Networks , 1988, USENIX Winter.

[11]  Dejan S. Milojicic Load distribution - implementation for the Mach microkernel , 1994 .

[12]  Jingwen Wang,et al.  Utopia: A load sharing facility for large, heterogeneous distributed computer systems , 1993, Softw. Pract. Exp..

[13]  David L. Black,et al.  An OSF/1 UNIX for Massively Parallel Multicomputers , 1993, USENIX Winter.

[14]  Samuel J. Leffler,et al.  The design and implementation of the 4.3 BSD Unix operating system , 1991, Addison-Wesley series in computer science.

[15]  Richard F. Rashid,et al.  Extending a capability based system into a network environment , 1986, SIGCOMM '86.

[16]  Fred Douglis,et al.  Transparent process migration: Design alternatives and the sprite implementation , 1991, Softw. Pract. Exp..

[17]  Amnon Barak,et al.  The MOSIX Distributed Operating System: Load Balancing for UNIX , 1993 .

[18]  Georg Stellner,et al.  Consistent Checkpoints of PVM Applications , 1994 .

[19]  Benjamin W. Wah,et al.  Automated Learning of Workload Measures for Load Balancing on a Distributed System , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[20]  Nitin H. Vaidya Another Two-Level Failure Recovery Scheme , 1994 .

[21]  D. Freedman Experience Building a Process Migration Subsystem for UNIX , 1991, USENIX Winter.

[22]  Michael B Jones,et al.  Transparently interposing user code at the system interface , 1994, [1992] Proceedings Third Workshop on Workstation Operating Systems.

[23]  Greg Thiel,et al.  LOCUS a network transparent, high reliability distributed system , 1981, SOSP.

[24]  Nitin H. VaidyaDepartment,et al.  Another Two-Level Failure Recovery Scheme : Performance Impact of Checkpoint Placement andCheckpoint Latency , 1994 .

[25]  Peter Steenkiste,et al.  Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .