An Adaptive Checkpointing Scheme for Peer-to-Peer Based Volunteer Computing Work Flows

Volunteer computing, sometimes called public resource computing, is an emerging computational model that is very suitable for work-pooled parallel processing. As more complex grid applications make use of work flows in their design and deployment it is reasonable to consider the impact of work flow deployment over a volunteer computing infrastructure. In this case, the inter work flow I/O can lead to a significant increase in I/O demands at the work pool server. A possible solution is the use of a peer-to-peer based parallel computing architecture to off-load this I/O demand to the workers; where the workers can fulfill some aspects of work flow coordination and I/O checking, etc. However, achieving robustness in such a large scale system is a challenging hurdle towards the decentralized execution of work flows and general parallel processes. To increase robustness, we propose and show the merits of using an adaptive checkpoint scheme that efficiently checkpoints the status of the parallel processes according to the estimation of relevant network and peer parameters. Based on our proposed mathematical checkpoint model, our scheme uses statistical data observed during runtime to dynamically make checkpoint decisions in a completely decentralized manner. The results of simulation show support for our proposed approach in terms of reduced required runtime.

[1]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[2]  Rudolf Eigenmann,et al.  Failure-aware checkpointing in fine-grained cycle sharing systems , 2007, HPDC '07.

[3]  Miguel Castro,et al.  Performance and dependability of structured peer-to-peer overlays , 2004, International Conference on Dependable Systems and Networks, 2004.

[4]  Gilles Fedak,et al.  The Computational and Storage Potential of Volunteer Computing , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[5]  Rudolf Eigenmann,et al.  Prediction of Resource Availability in Fine-Grained Cycle Sharing Systems Empirical Evaluation , 2007, Journal of Grid Computing.

[6]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[7]  Thomas Hérault,et al.  MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI , 2006, Int. J. High Perform. Comput. Appl..

[8]  Yafei Dai,et al.  Understanding the Dynamic of Peer-to-Peer Systems , 2007, IPTPS.

[9]  Stefan Savage,et al.  Understanding Availability , 2003, IPTPS.

[10]  Aaron Harwood,et al.  A comparative study on Peer-to-Peer failure rate estimation , 2007, 2007 International Conference on Parallel and Distributed Systems.

[11]  Nitin H. Vaidya,et al.  Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme , 1997, IEEE Trans. Computers.

[12]  Anne-Marie Kermarrec,et al.  Lightweight probabilistic broadcast , 2003, TOCS.

[13]  Richard Wolski,et al.  Modeling Machine Availability in Enterprise and Wide-Area Distributed Computing Environments , 2005, Euro-Par.

[14]  David P. Anderson,et al.  BOINC: a system for public-resource computing and storage , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[15]  Yong Meng Teo,et al.  An adaptive stabilization framework for distributed hash tables , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[16]  David P. Anderson,et al.  SETI@home: an experiment in public-resource computing , 2002, CACM.

[17]  Xiaola Lin,et al.  A Variational Calculus Approach to Optimal Checkpoint Placement , 2001, IEEE Trans. Computers.

[18]  Luis F. G. Sarmenta,et al.  Volunteer Computing , 1996 .

[19]  Johan A. Pouwelse,et al.  The Bittorrent P2P File-Sharing System: Measurements and Analysis , 2005, IPTPS.

[20]  David Mazières,et al.  Kademlia: A Peer-to-Peer Information System Based on the XOR Metric , 2002, IPTPS.

[21]  David R. Karger,et al.  Chord: a scalable peer-to-peer lookup protocol for internet applications , 2003, TNET.

[22]  Aaron Harwood,et al.  An Implementation of the Message Passing Interface over an Adaptive Peer-to-Peer Network , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[23]  Brian D. Noble,et al.  Exploiting Availability Prediction in Distributed Systems , 2006, NSDI.

[24]  Peter J. Stuckey,et al.  Realizing the e-science desktop peer using a peer-to-peer distributed virtual machine middleware , 2006, MCG '06.

[25]  Vijay S. Pande,et al.  Folding@Home and Genome@Home: Using distributed computing to tackle previously intractable problem , 2009, 0901.0866.

[26]  Stéphane Genaud,et al.  A Peer-to-Peer Framework for Robust Execution of Message Passing Parallel Programs on Grids , 2005, PVM/MPI.

[27]  Krishna P. Gummadi,et al.  The impact of DHT routing geometry on resilience and proximity , 2003, SIGCOMM '03.

[28]  Fabián E. Bustamante,et al.  Friendships that Last: Peer Lifespan and its Role in P2P Protocols , 2003, WCW.