Dome: Parallel Programming in a Heteroge-neous Multi-User Environment

Writing parallel programs for distributed multi-user computing environments is a difficult task. The Distributed object migration environment (Dome) addresses three major issues of parallel computing in an architecture independent manner: ease of programming, dynamic load balancing, and fault tolerance. Dome programmers, with modest effort, can write parallel programs that are automatically distributed over a heterogeneous network, dynamically load balanced as the program runs, and able to survive compute node and network failures. This paper provides the motivation for and an overview of Dome, including a preliminary performance evaluation of dynamic load balancing for distributed vectors. Dome programs are shorter and easier to write than the equivalent programs written with message passing primitives. The performance overhead of Dome is characterized, and it is shown that this overhead can be recouped by dynamic load balancing in imbalanced systems. Finally, we show that a parallel program can be made failure resilient through Dome''s architecture independent checkpoint and restart mechanisms.

[1]  James H. Patterson,et al.  Portable Programs for Parallel Processors , 1987 .

[2]  Luís Moura Silva,et al.  Global checkpointing for distributed programs , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[3]  Shahid H. Bokhari,et al.  Assignment Problems in Parallel and Distributed Computing , 1987 .

[4]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[5]  Peter Steenkiste,et al.  Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .

[6]  Jeffrey F. Naughton,et al.  Checkpointing multicomputer applications , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[7]  Jack Dongarra,et al.  Pvm 3 user's guide and reference manual , 1993 .

[8]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[9]  Andrzej Duda,et al.  The Effects of Checkpointing on Program Execution Time , 1983, Inf. Process. Lett..

[10]  Jack Dongarra,et al.  An object oriented design for high performance linear algebra on distributed memory architectures , 1993 .

[11]  Edward D. Lazowska,et al.  Adaptive load sharing in homogeneous distributed systems , 1986, IEEE Transactions on Software Engineering.

[12]  Darrell D. E. Long,et al.  A study of the reliability of Internet sites , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[13]  Ronald J. Vetter,et al.  ATM concepts, architectures, and protocols , 1995, CACM.

[14]  Erik Seligman,et al.  High-Level Fault Tolerance in Distributed Programs , 1994 .

[15]  James M. Purtilo,et al.  Dynamic reconfiguration in distributed systems: adapting software modules for replacement , 1993, [1993] Proceedings. The 13th International Conference on Distributed Computing Systems.

[16]  Jeffrey F. Naughton,et al.  Real-time, concurrent checkpoint for parallel programs , 1990, PPOPP '90.

[17]  A. Malony,et al.  Implementing a parallel C++ runtime system for scalable parallel systems , 1993, Supercomputing '93.

[18]  Nicholas Carriero,et al.  How to write parallel programs: a guide to the perplexed , 1989, CSUR.

[19]  Jack J. Dongarra,et al.  Visualization and debugging in a heterogeneous environment , 1993, Computer.

[20]  Kai Li,et al.  ickp: a consistent checkpointer for multicomputers , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[21]  Jack Dongarra,et al.  PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .

[22]  M. Moura Silva,et al.  Checkpointing SPMD applications on transputer networks , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[23]  Joel H. Saltz,et al.  Dynamic Remapping of Parallel Computations with Varying Resource Demands , 1988, IEEE Trans. Computers.

[24]  Jenq Kuen Lee,et al.  Object oriented parallel programming: experiments and results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).