Fault-tolerance, malleability and migration for divide-and-conquer applications on the grid

Grid applications have to cope with dynamically changing computing resources as machines may crash or be claimed by other, higher-priority applications. In this paper, we propose a mechanism that enables fault-tolerance, malleability (e.g. the ability to cope with a dynamically changing number of processors) and migration for divide-and-conquer applications on the grid. The novelty of our approach is restructuring the computation tree, which eliminates redundant computation and salvages partial results computed by the processors leaving the computation. This enables the applications to adapt to dynamically changing numbers of processors and to migrate the computation without loss of work. Our mechanism is easy to implement and deploy in grid environment. The overhead it incurs is close to zero. We have implemented our mechanism in the Satin system. We have evaluated the performance of our system on the DAS-2 wide-area system and on the testbed of the European GridLab project.

[1]  Henri E. Bal,et al.  Efficient load balancing for wide-area divide-and-conquer applications , 2001, PPoPP '01.

[2]  Sathish S. Vadhiyar,et al.  SRS: A Framework for Developing Malleable and Migratable Parallel Applications for Distributed Systems , 2003, Parallel Process. Lett..

[3]  Zvi M. Kedem,et al.  Charlotte: Metacomputing on the Web , 1999, Future Gener. Comput. Syst..

[4]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[5]  Jason Maassen,et al.  Satin: Simple and Efficient Java-based Grid Programming , 2005, Scalable Comput. Pract. Exp..

[6]  G. Allen,et al.  The Cactus Code: a problem solving environment for the grid , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[7]  Hermann Kopetz,et al.  Fault tolerance, principles and practice , 1990 .

[8]  John Shalf,et al.  Enabling Applications on the Grid: A Gridlab Overview , 2003, Int. J. High Perform. Comput. Appl..

[9]  Peter M. A. Sloot,et al.  Experiments with Migration of Message-Passing Tasks , 2000, GRID.

[10]  Robert M. Keller,et al.  Distributed Recovery in Applicative Systems , 1986, ICPP.

[11]  Udi Manber,et al.  DIB—a distributed implementation of backtracking , 1987, TOPL.

[12]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local , 1995 .

[13]  Eric A. Brewer,et al.  ATLAS: an infrastructure for global computing , 1996, EW 7.

[14]  James S. Plank Efficient checkpointing on MIMD architectures , 1993 .

[15]  Domenico Talia,et al.  A Grid Programming Primer , 2001 .

[16]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[17]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[18]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[19]  Robert D. Blumofe,et al.  Adaptive and Reliable ParallelComputing9 Networks of Workstations , 1997 .

[20]  Laxmikant V. Kalé,et al.  A Malleable-Job System for Timeshared Parallel Machines , 2002, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02).

[21]  Kento Aida,et al.  Distributed computing with hierarchical master-worker paradigm for parallel branch and bound algorithm , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[22]  Jeff T. Linderoth,et al.  An enabling framework for master-worker applications on the Computational Grid , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.