RPC-V: Toward Fault-Tolerant RPC for Internet Connected Desktop Grids with Volatile Nodes

RPC is one of the programming models envisioned for the Grid. In Internet connected Large Scale Grids such as Desktop Grids, nodes and networks failures are not rare events. This paper provides several contributions, examining the feasibility and limits of fault-tolerant RPC on these platforms. First, we characterize these Grids from their fundamental features and demonstrate that their applications scope should be safely restricted to stateless services. Second, we present a new fault-tolerant RPC protocol associating an original combination of three-tier architecture, passive replication and message logging. We describe RPC-V, an implementation of the proposed protocol within the XtremWeb Desktop Grid middleware. Third, we evaluate the performance of RPC-V and the impact of faults on the execution time, using a real life application on a Desktop Grid testbed assembling nodes in France and USA. We demonstrate that RPC-V allows the applications to continue their execution while key system components fail.

[1]  Andrew S. Grimshaw,et al.  Grids: Harnessing Geographically-Separated Resources in a Multi-Organisational Context , 2003 .

[2]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[3]  Ziv Bar-Joseph,et al.  A tight lower bound for randomized synchronous consensus , 1998, PODC '98.

[4]  Jack Dongarra,et al.  Active Logistical State Management in GridSolve/L , 2003 .

[5]  Gilles Fedak,et al.  XtremWeb: a generic global computing system , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[6]  Jorge J. Moré,et al.  The NEOS Server , 1998 .

[7]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 2004, Cluster Computing.

[8]  Jack Dongarra,et al.  NetSolve's Network Enabled Server: Examples and Applications , 1999 .

[9]  Michael Dahlin,et al.  End-to-end WAN service availability , 2001, TNET.

[10]  Henri Casanova,et al.  Netsolve: a Network-Enabled Server for Solving Computational Science Problems , 1997, Int. J. High Perform. Comput. Appl..

[11]  Ian T. Foster,et al.  A security architecture for computational grids , 1998, CCS '98.

[12]  Roberto Baldoni,et al.  Asynchronous active replication in three-tier distributed systems , 2002, 2002 Pacific Rim International Symposium on Dependable Computing, 2002. Proceedings..

[13]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[14]  Michael O. Rabin,et al.  Randomized byzantine generals , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[15]  Samir Djilali P2P-RPC: programming scientific applications on peer-to-peer systems with remote procedure call , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[16]  Jack Dongarra,et al.  Applying NetSolve's network-enabled server , 1998 .

[17]  Jie Xu,et al.  Fault Tolerance within a Grid Environment , 2003 .

[18]  Aleta Ricciardi Proceedings of the twenty-first annual symposium on Principles of distributed computing , 2002, PODC 2002.

[19]  Henri Casanova,et al.  Deploying fault tolerance and taks migration with NetSolve , 1999, Future Gener. Comput. Syst..

[20]  Henri Casanova,et al.  Overview of GridRPC: A Remote Procedure Call API for Grid Computing , 2002, GRID.

[21]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[22]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[23]  Peter Arbenz,et al.  The Remote Computation System , 1996, Parallel Comput..

[24]  Indranil Gupta,et al.  Building Scalable Solutions to Distributed Computing Problems Using Probabilistic Components , 2004 .

[25]  Henri Casanova,et al.  Adaptive Scheduling for Task Farming with Grid Middleware , 1999, Int. J. High Perform. Comput. Appl..

[26]  Stefan Saroiu,et al.  A Measurement Study of Peer-to-Peer File Sharing Systems , 2001 .

[27]  Idit Keidar,et al.  Evaluating the running time of a communication round over the internet , 2002, PODC '02.

[28]  Roberto Baldoni,et al.  Three‐tier replication for FT‐CORBA infrastructures , 2003, Softw. Pract. Exp..

[29]  Mitsuhisa Sato,et al.  OmniRPC: A Grid RPC Facility for Cluster and Global Computing in OpenMP , 2001, WOMPAT.

[30]  Mitsuhisa Sato,et al.  Design and implementations of Ninf: towards a global computing infrastructure , 1999, Future Gener. Comput. Syst..

[31]  Liviu Iftode,et al.  Mi-gratory tcp: Highly available internet services using connection migration , 2001, IEEE International Conference on Distributed Computing Systems.

[32]  Henri Casanova,et al.  Adaptive Scheduling for Task Farming with Grid Middleware , 1999, Euro-Par.

[33]  Henri Casanova,et al.  Deploying Fault-Tolerance and Task Migration with NetSolve , 1998, PARA.

[34]  Francine Berman,et al.  The AppLeS Parameter Sweep Template: User-Level Middleware for the Grid , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[35]  Robbert van Renesse,et al.  The power of epidemics: robust communication for large-scale distributed systems , 2003, CCRV.