MARS - A framework for minimizing the job execution time in a metacomputing environment

Abstract Utilizing a collection of workstations and supercomputers in a metacomputing environment does not only offer an enormous amount of computing power, but also raises new problems. The true potential of WAN-based distributed computing can only be exploited if the application-to-architecture mapping reflects the different processor speeds, network performances and the application's communication characteristics. In this paper, we present the Metacomputer Adaptive Runtime System (MARS), a framework for minimizing the execution time of distributed applications on a WAN metacomputer. Work-load balancing and task migration is based on dynamic information on the processor load and network performance. Moreover, MARS uses accumulated statistical data on previous execution runs of the same application to derive an improved task-to-process mapping. Migration decisions are based on: (1) the current system load; (2) the network load; and (3) previously obtained application-specific characteristics. Our current implementation supports C applications with MPI message passing calls, but the general framework is also applicable to other programing environments like PVM, PARMACS and Express.

[1]  Arif Ghafoor,et al.  On the Assignment Problem of Arbitrary Process Systems to Heterogeneous Distributed Computer Systems , 1992, IEEE Trans. Computers.

[2]  Shikharesh Majumdar,et al.  Characterisation of Programs for Scheduling in Multiprogrammed Parallel Systems , 1991, Perform. Evaluation.

[3]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[4]  Jonathan Walpole,et al.  MPVM: A Migration Transparent Version of PVM , 1995, Comput. Syst..

[5]  Joel H. Saltz,et al.  Dynamic Remapping of Parallel Computations with Varying Resource Demands , 1988, IEEE Trans. Computers.

[6]  Wolfgang Becker,et al.  Exploiting inter task dependencies for dynamic load balancing , 1994, Proceedings of 3rd IEEE International Symposium on High Performance Distributed Computing.

[7]  E. Barszcz,et al.  Profiling the communication workload of an iPSC/860 , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[8]  Matt W. Mutka Estimating Capacity For Sharing in a Privately Owned Workstation Environment , 1992, IEEE Trans. Software Eng..

[9]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[10]  Ravi Mirchandaney,et al.  Experiences with networked parallel computing , 1995, Concurr. Pract. Exp..

[11]  Wolfgang Becker Dynamic balancing complex workload in workstation networks - challenge, concepts and experience , 1995, HPCN Europe.

[12]  Edward D. Lazowska,et al.  A Comparison of Receiver-Initiated and Sender-Initiated Adaptive Load Sharing , 1986, Perform. Evaluation.

[13]  Peter M. A. Sloot,et al.  Experiments in Dynamic Load Balancing for Parallel Cluster Computing , 1995 .

[14]  Kai Li,et al.  ickp: a consistent checkpointer for multicomputers , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[15]  Volker Strumpen,et al.  Efficient Parallel Computing in Distributed Workstation Environments , 1993, Parallel Comput..

[16]  Kai Li,et al.  Performance results of ickp-a consistent checkpointer on the iPSC/860 , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[17]  Andrew S. Grimshaw,et al.  A framework for partitioning parallel computations in heterogeneous environments , 1995, Concurr. Pract. Exp..

[18]  Volker Strumpen A Large-Scale Metacomputer Approach for Distributed Parallel Computing , 1994, HPCN.

[19]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[20]  Erik Seligman,et al.  Dome: Parallel Programming in a Heteroge-neous Multi-User Environment , 1995 .

[21]  Erik Seligman,et al.  High-Level Fault Tolerance in Distributed Programs , 1994 .

[22]  Kurt Kremer,et al.  A Distributed Computing Center Software for the Efficient Use of Parallel Computer Systems , 1994, HPCN.

[23]  Andrew S. Grimshaw,et al.  Metasystems: An Approach Combining Parallel Processing and Heterogeneous Distributed Computing Systems , 1994, J. Parallel Distributed Comput..

[24]  Miron Livny,et al.  Profiling Workstations' Available Capacity for Remote Execution , 1987, Performance.

[25]  Volker Strumpen,et al.  The Parform - A High Performance Platform for Parallel Computation ina Distributed Workstation Environment , 1992 .

[26]  Reinhard Lüling,et al.  Towards developing universal dynamic mapping algorithms , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[27]  Clifford C. Huff,et al.  Elements of a realistic CASE tool adoption budget , 1992, CACM.

[28]  Jonathan Walpole,et al.  A user-level process package for PVM , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[29]  Hesham El-Rewini,et al.  Parallax: a tool for parallel program scheduling , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[30]  Jonathan Walpole,et al.  MIST: PVM with Transparent Migration and Checkpointing , 1995 .

[31]  Salim Hariri,et al.  Communication system for high-performance distributed computing , 1994, Concurr. Pract. Exp..

[32]  Reinhard Lüling,et al.  Load balancing in large networks: a comparative study , 1991, Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing.

[33]  Friedhelm Ramme Building a virtual machine-room - a focal point in metacomputing , 1995, Future Gener. Comput. Syst..

[34]  Thomas L. Casavant,et al.  A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems , 1988, IEEE Trans. Software Eng..

[35]  Peter M. A. Sloot,et al.  DynamicPVM - Dynamic Load Balancing on Parallel Systems , 1994, HPCN.