FT-Grid: a system for achieving fault tolerance in grids

The FT-Grid system introduces a fault-tolerance framework that allows faults occurring in service-oriented systems to be tolerated, thus increasing the dependability of such systems. This paper presents the design, development and evaluation of FT-Grid. We show empirical evidence of the dependability benefits offered by FT-Grid by performing an experimental dependability analysis using fault-injection testing performed with the WS-FIT tool. We then illustrate a potential problem with voting-based fault-tolerance schemes in the service-oriented paradigm—namely that individual channels within a fault-tolerant system, supposed to be independent of each other, may in fact invoke common services as part of their workflow, thus increasing the potential for common-mode failure of those channels. We propose a solution to this issue by using the technique of provenance to provide FT-Grid with topological awareness. We implement a large experimental system, and—with the use of the Provenance Recording for Services system developed as part of the PASOA project at the University of Southampton—perform a large number of experiments that show that a topologically aware FT-Grid system serves as a much more dependable system than any other configuration tested, while imposing a negligible timing overhead. Copyright © 2007 John Wiley & Sons, Ltd.

[1]  Jie Xu,et al.  Building dependable software for critical applications: multi-version software versus one good version , 2001, Proceedings Sixth International Workshop on Object-Oriented Real-Time Dependable Systems.

[2]  Stuart Bennett,et al.  History-based weighted average voter: a novel software voting algorithm for fault-tolerant computer systems , 2001, Proceedings Ninth Euromicro Workshop on Parallel and Distributed Processing.

[3]  Luc Moreau,et al.  Recording and Reasoning over Data Provenance in Web and Grid Services , 2003, OTM.

[4]  Michael Luck,et al.  A Protocol for Recording Provenance in Service-Oriented Grids , 2004, OPODIS.

[5]  Huimin Zhao,et al.  Pricing Web Services for Optimizing Resource Allocation – An Implementation Scheme , 2003 .

[6]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[7]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.