How to bring together fault tolerance and data consistency to enable Grid data sharing

This paper addresses the challenge of transparent data sharing within computing Grids built as cluster federations. On such platforms, the availability of storage resources may change in a dynamic way, often due to hardware failures. We focus on the problem of handling the consistency of replicated data in the presence of failures. We propose a software architecture which decouples consistency management from fault tolerance management. We illustrate this architecture with a case study showing how to design a consistency protocol using fault‐tolerant building blocks. As a proof of concept, we describe a prototype implementation of this protocol within JUXMEM, a software experimental platform for Grid data sharing, and we report on a preliminary experimental evaluation of the proposed approach. Copyright © 2006 John Wiley & Sons, Ltd.

[1]  Pierre Sens,et al.  An Effective Logical Cache for a Clustered LRC-Based DSM System , 2004, Cluster Computing.

[2]  Idit Keidar,et al.  Group communication specifications: a comprehensive study , 2001, CSUR.

[3]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[4]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[5]  Mathieu Jan,et al.  JuxMem: An Adaptive Supportive Platform for Data Sharing on the Grid , 2001, Scalable Comput. Pract. Exp..

[6]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1989, TOCS.

[7]  Luigi Rizzo,et al.  Dummynet: a simple approach to the evaluation of network protocols , 1997, CCRV.

[8]  Anne-Marie Kermarrec,et al.  A recoverable distributed shared memory integrating coherence and recoverability , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[9]  Gabriel Antoniu,et al.  Making a DSM consistency protocol hierarchy-aware: an efficient synchronization scheme , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[10]  Liviu Iftode,et al.  Scalable Fault-Tolerant Distributed Shared Memory , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[11]  Liviu Iftode,et al.  Scope Consistency: A Bridge between Release Consistency and Entry Consistency , 1996, SPAA '96.

[12]  André Schiper,et al.  A Step Towards a New Generation of Group Communication Systems , 2003, Middleware.

[13]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[14]  Sébastien Monnet,et al.  Large-Scale Deployment in P2P Experiments Using the JXTA Distributed Framework , 2004, Euro-Par.

[15]  Brian N. Bershad,et al.  The Midway distributed shared memory system , 1993, Digest of Papers. Compcon Spring.

[16]  Liviu Iftode,et al.  Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems , 1996, OSDI '96.

[17]  Veljko M. Milutinovic,et al.  Distributed shared memory: concepts and systems , 1997, IEEE Parallel Distributed Technol. Syst. Appl..

[18]  Alessandro Bassi,et al.  The Internet Backplane Protocol: A Study in Resource Sharing , 2002, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02).

[19]  Miron Livny,et al.  Stork: making data placement a first class citizen in the grid , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[20]  Pierre Sens,et al.  Implementation and performance evaluation of an adaptable failure detector , 2002, Proceedings International Conference on Dependable Systems and Networks.

[21]  Ian T. Foster,et al.  Data management and transfer in high-performance computational grid environments , 2002, Parallel Comput..

[22]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[23]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[24]  Rachid Guerraoui,et al.  Software-Based Replication for Fault Tolerance , 1997, Computer.