Traditionally, PVM and MPI programs live on message passing systems, from clusters of non-dedicated workstations to MPP machines. The performance of a parallel program in such an environment is usuallyd etermined by the single least performing task in that program. In a homogeneous, stable environment, such as an MPP machine, this can only be repaired by improving the workload balance between the individual tasks. In a cluster of workstations, differences in the performance of individual nodes and network components can be an important cause of imbalance. Moreover, these differences will be time dependent as the load generated by other users plays an important role. Worse yet, nodes may be dynamically removed from the available pool of workstations. In such a dynamically changing environment, redistributing tasks over the available nodes can help to maintain the performance of individual programs and of the pool as a whole. Condor [1] solves this task migration problem for sequential programs. However, the migration of tasks in a parallel program presents a number of additional challenges, for the migrator as well as for the scheduler. For PVM programs, there are a number of solutions, including Dynamite [2]; Hector [3] was designed to migrate MPI tasks and to checkpoint complete MPI programs. The latter capabilityis very desirable for long-running programs in an unreliable environment.This brings us to the Grid, where both performance and availability of resources vary dynamically and where reliability is an important issue. Once again, Livny with his Condor-G [4] provides a solution for sequential programs, including provisions for fault-tolerance. In the Polder Metacomputer Project, based on our experience with Dynamite, we are currently investigating the additional challenges in creating a task-migration and checkpointing capability for the Grid environment. This includes the handling of shared resources, such as open files; differences in administrative domains, etc. Eventually, the migration of parallel programs will allow large parallel applications to surf the Grid and ride the waves in this highly dynamic environment.
[1]
Miron Livny,et al.
Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System
,
1997
.
[2]
Jonathan Robinson,et al.
A task migration implementation of the Message-Passing Interface
,
1996,
Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing.
[3]
Ian T. Foster,et al.
Condor-G: A Computation Management Agent for Multi-Institutional Grids
,
2004,
Cluster Computing.
[4]
Peter M. A. Sloot,et al.
The implementation of dynamite: an environment for migrating PVM tasks
,
2000,
OPSR.