FT‐Grid: a system for achieving fault tolerance in grids

The FT‐Grid system introduces a fault‐tolerance framework that allows faults occurring in service‐oriented systems to be tolerated, thus increasing the dependability of such systems. This paper presents the design, development and evaluation of FT‐Grid. We show empirical evidence of the dependability benefits offered by FT‐Grid by performing an experimental dependability analysis using fault‐injection testing performed with the WS‐FIT tool. We then illustrate a potential problem with voting‐based fault‐tolerance schemes in the service‐oriented paradigm—namely that individual channels within a fault‐tolerant system, supposed to be independent of each other, may in fact invoke common services as part of their workflow, thus increasing the potential for common‐mode failure of those channels. We propose a solution to this issue by using the technique of provenance to provide FT‐Grid with topological awareness. We implement a large experimental system, and—with the use of the Provenance Recording for Services system developed as part of the PASOA project at the University of Southampton—perform a large number of experiments that show that a topologically aware FT‐Grid system serves as a much more dependable system than any other configuration tested, while imposing a negligible timing overhead. Copyright © 2007 John Wiley & Sons, Ltd.

[1]  Luc Moreau,et al.  Recording and Reasoning over Data Provenance in Web and Grid Services , 2003, OTM.

[2]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[3]  Huimin Zhao,et al.  Pricing Web Services for Optimizing Resource Allocation – An Implementation Scheme , 2003 .

[4]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[5]  Michael Luck,et al.  A Protocol for Recording Provenance in Service-Oriented Grids , 2004, OPODIS.

[6]  Malcolm Munro,et al.  SIMULATING ERRORS IN WEB SERVICES , 2004 .

[7]  Stuart Bennett,et al.  History-based weighted average voter: a novel software voting algorithm for fault-tolerant computer systems , 2001, Proceedings Ninth Euromicro Workshop on Parallel and Distributed Processing.

[8]  Jie Xu,et al.  Building dependable software for critical applications: multi-version software versus one good version , 2001, Proceedings Sixth International Workshop on Object-Oriented Real-Time Dependable Systems.