Using Cliques Of Nodes To Store Desktop Grid Checkpoints

Checkpoints that store intermediate results of computation have a fundamental impact on the computing throughput of Desktop Grid systems, like BOINC. Currently, BOINC workers store their checkpoints locally. A major limitation of this approach is that whenever a worker leaves unfinished computation, no other worker can proceed from the last stable checkpoint. This forces tasks to be restarted from scratch when the original machine is no longer available. To overcome this limitation, we propose to share checkpoints between nodes. To organize this mechanism, we arrange nodes to form complete graphs (cliques), where nodes share all the checkpoints they compute. Cliques function as survivable units, where checkpoints and tasks are not lost as long as one of the nodes of the clique remains alive. To simplify construction and maintenance of the cliques, we take advantage of the central supervisor of BOINC. To evaluate our solution, we combine simulation with some real data to answer the most fundamental question: what do we need to pay for increased throughput?

[1]  Bruno Richard,et al.  Clique: A Transparent, Peer-to-Peer Replicated File System , 2003, Mobile Data Management.

[2]  Paulo Marques,et al.  Resource usage of Windows computer laboratories , 2005, 2005 International Conference on Parallel Processing Workshops (ICPPW'05).

[3]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[4]  Bruno Richard,et al.  Clique: A transparent, Peer-to-Peer collaborative file sharing system , 2002 .

[5]  Antony I. T. Rowstron,et al.  PAST: a large-scale, persistent peer-to-peer storage utility , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[6]  Emil Sit,et al.  A DHT-based Backup System , 2003 .

[7]  Emin Gün Sirer,et al.  Herbivore: A Scalable and Efficient Protocol for Anonymous Communication , 2003 .

[8]  Siddhartha Annapureddy,et al.  Shark: scaling file servers via cooperative caching , 2005, NSDI.

[9]  Andrew Martin,et al.  On two kinds of public−resource distributed computing , 2005 .

[10]  Srikanth Kandula,et al.  LARK: a light-weight, resilient application-level multicast protocol , 2003, 2002 14th International Conference on Ion Implantation Technology Proceedings (IEEE Cat. No.02EX505).

[11]  Luís Moura Silva,et al.  Sharing checkpoints to improve turnaround time in desktop grid computing , 2006, 20th International Conference on Advanced Information Networking and Applications - Volume 1 (AINA'06).

[12]  Luís Moura Silva,et al.  A DHT-Based Infrastructure for Sharing Checkpoints in Desktop Grid Computing , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[13]  Jamie Kettleborough,et al.  On two kinds of public-resource distributed computing , 2005 .

[14]  Ben Y. Zhao,et al.  Maintenance-Free Global Data Storage , 2001, IEEE Internet Comput..

[15]  David P. Anderson,et al.  BOINC: a system for public-resource computing and storage , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.