Fault‐tolerant execution of large parameter sweep applications across multiple VOs with storage constraints

Applications that span multiple virtual organizations (VOs) are of great interest to the e‐science community. However, our recent attempts to execute large‐scale parameter sweep applications (PSAs) for real‐world climate studies with the Nimrod/G tool have exposed problems in the areas of fault tolerance, data storage and trust management. In response, we have implemented a task‐splitting approach that facilitates breaking up large PSAs into a sequence of dependent subtasks, improving fault tolerance; provides a garbage collection technique that deletes unnecessary data; and employs a trust delegation technique that facilitates flexible third party data transfers across different VOs. Copyright © 2008 John Wiley & Sons, Ltd.

[1]  Joel R. Stiles,et al.  Monte Carlo simulation of neuro-transmitter release using MCell, a general simulator of cellular physiological processes , 1998 .

[2]  David Abramson,et al.  A Computational Economy for Grid Computing and its Implementation in the Nimrod-G Resource Brok , 2001, Future Gener. Comput. Syst..

[3]  David Abramson,et al.  Bridging organizational network boundaries on the grid , 2005, The 6th IEEE/ACM International Workshop on Grid Computing, 2005..

[4]  Brian A. Coghlan,et al.  Bridging Secure WebCom and European DataGrid security for multiple VOs over multiple grids , 2004, Third International Symposium on Parallel and Distributed Computing/Third International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks.

[5]  Marta Mattoso,et al.  Planning spatial workflows to optimize grid performance , 2006, SAC.

[6]  Wolfgang Gentzsch,et al.  Sun Grid Engine: towards creating a compute power grid , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[7]  John Darlington,et al.  Using ICENI to run parameter sweep applications across multiple Grid resources , 2004 .

[8]  Francine Berman,et al.  Heuristics for scheduling parameter sweep applications in grid environments , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[9]  Francine Berman,et al.  The AppLeS Parameter Sweep Template: User-Level Middleware for the Grid , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[10]  Stuart E. Rogers,et al.  Comparison of Implicit Schemes for the Incompressible Navier-Stokes Equations , 1995 .

[11]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[12]  Yuval Tamir,et al.  ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS , 1984 .

[13]  David Abramson,et al.  High performance parametric modeling with Nimrod/G: killer application for the global grid? , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[14]  S Rogers,et al.  A comparison of implicit schemes for the incompressible Navier-Stokes equations with artificial compressibility , 1995 .

[15]  Rajkumar Buyya,et al.  A Deadline and Budget Constrained Cost-Time Optimisation Algorithm for Scheduling Task Farming Applications on Global Grids , 2002, ArXiv.

[16]  Erik A. Hendriks,et al.  BProc: the Beowulf distributed process space , 2002, ICS '02.

[17]  Reagan Moore,et al.  The SDSC storage resource broker , 2010, CASCON.

[18]  Rajkumar Buyya,et al.  The Gridbus toolkit for service oriented grid and utility computing: an overview and status report , 2004, 1st IEEE International Workshop on Grid Economics and Business Models, 2004. GECON 2004..

[19]  Miron Livny,et al.  Harnessing the Capacity of Computational Grids for High Energy Physics , 2000 .

[20]  David Abramson,et al.  An Evaluation of Economy-based Resource Trading and Scheduling on Computational Power Grids for Parameter Sweep Applications , 2000 .

[21]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[22]  David Abramson,et al.  The PRAGMA Testbed - Building a Multi-Application International Grid , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).