Adaptive Scheduling for Task Farming with Grid Middleware

Scheduling in metacomputing environments is an active field of research as the vision of a Computational Grid becomes more concrete. An important class of Grid applications are long-running parallel computations with large numbers of somewhat independent tasks (Monte Carlo simulations, parameter-space searches, etc.). A number of Grid middleware projects are available to implement such applications, but scheduling strategies are still open research issues. This is mainly due to the diversity of both Grid resource types and their availability patterns. The purpose of this work is to develop and validate a general adaptive scheduling algorithm for task farming applications along with a user interface that makes the algorithm accessible to domain scientists. The authors’ algorithm is general in that it is not tailored to a particular Grid middleware and it requires very few assumptions concerning the nature of the resources. Their first testbed is NetSolve as it allows quick and easy development of the algorithm by isolating the developer from issues such as process control, I/O, remote software access, or fault-tolerance.

[1]  Henri Casanova,et al.  Netsolve: a Network-Enabled Server for Solving Computational Science Problems , 1997, Int. J. High Perform. Comput. Appl..

[2]  James C. French,et al.  A Synopsis of the Legion Project , 1994 .

[3]  Jack Dongarra,et al.  Providing Uniform Dynamic Access to Numerical Software , 1999 .

[4]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[5]  Joel R. Stiles,et al.  Monte Carlo simulation of neuro-transmitter release using MCell, a general simulator of cellular physiological processes , 1998 .

[6]  Jack Dongarra,et al.  NetSolve's Network Enabled Server: Examples and Applications , 1999 .

[7]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[8]  Henri Casanova,et al.  NetSovle: A Network Server for Solving Computational Science Problems , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[9]  Jack Dongarra,et al.  Applying NetSolve's network-enabled server , 1998 .

[10]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[11]  Stephen Wolfram,et al.  The Mathematica book (3rd ed.) , 1996 .

[12]  Gary Shao Performance Efiects of Scheduling Strategies for Master/Slave Distributed Applications , 1998 .

[13]  Joel R. Stiles,et al.  Miniature Endplate Current Rise Times <100 mu s from Improved Dual Recordings Can be Modeled with Passive Acetylcholine Diffusion from a Synaptic Vesicle , 1996 .

[14]  H. C Dongarra NetSolve version 1.2: Design and Implementation , 1998 .

[15]  Richard D. Schlichting,et al.  Supporting Fault-Tolerant Parallel Programming in Linda , 1995, IEEE Trans. Parallel Distributed Syst..

[16]  Henri Casanova,et al.  Client User''s Guide to NetSolve , 1996 .

[17]  David Gelernter,et al.  Supercomputing out of recycled garbage: preliminary experience with Piranha , 1992, ICS '92.

[18]  Francine Berman,et al.  Logistical quality of service in NetSolve , 1999, Comput. Commun..

[19]  Miron Livny,et al.  Experience with the Condor distributed batch system , 1990, IEEE Workshop on Experimental Distributed Systems.

[20]  Stephen Wolfram,et al.  The Mathematica Book , 1996 .

[21]  Francine Berman,et al.  Application-Level Scheduling on Distributed Heterogeneous Networks , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[22]  R. Sosič,et al.  The Nimrod computational workbench: a case study in desktop metacomputing , 1996 .

[23]  David Abramson,et al.  Scheduling large parametric modelling experiments on a distributed meta-computer , 1997 .

[24]  Francine Berman,et al.  The AppLeS Project: A Status Report , 1997 .

[25]  Luís Moura Silva,et al.  Portable checkpointing and recovery , 1995, Proceedings of the Fourth IEEE International Symposium on High Performance Distributed Computing.

[26]  Partha Dasgupta,et al.  CALYPSO: a novel software system for fault-tolerant parallel processing on distributed platforms , 1995, Proceedings of the Fourth IEEE International Symposium on High Performance Distributed Computing.