Performance evaluation of fault tolerance techniques in grid computing system

As fault tolerance is the ability of a system to perform its function correctly even in the presence of faults. Therefore, different fault tolerance techniques (FTTs) are critical for improving the efficient utilization of expensive resources in high performance grid computing systems, and an important component of grid workflow management system. This paper presents a performance evaluation of most commonly used FTTs in grid computing system. In this study, we considered different system centric parameters, such as throughput, turnaround time, waiting time and network delay for the evaluation of these FTTs. For comprehensive evaluation we setup various conditions in which we vary the average percentage of faults in a system, along with different workloads in order to find out the behavior of FTTs under these conditions. The empirical evaluation shows that the workflow level alternative task techniques have performance priority on task level checkpointing techniques. This comparative study will help to grid computing researchers in order to understand the behavior and performance of different FTTs in detail.

[1]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[2]  Bharadwaj Veeravalli,et al.  Fault-tolerant scheduling for differentiated classes of tasks with low replication cost in computational grids , 2007, HPDC '07.

[3]  Chunlin Li,et al.  A system-centric scheduling policy for optimizing objectives of application and resource in grid computing , 2009, Comput. Ind. Eng..

[4]  Ewa Deelman,et al.  Integrating existing scientific workflow systems: the Kepler/Pegasus example , 2007, WORKS '07.

[5]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 2004, Cluster Computing.

[6]  Rajkumar Buyya,et al.  GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for Grid computing , 2002, Concurr. Comput. Pract. Exp..

[7]  Mark Hedges,et al.  Arts and Humanities e-Science From Ad Hoc Experimentation to Systematic Investigation , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[8]  Paul D. Manuel,et al.  A hybrid fault tolerance technique in grid computing system , 2011, The Journal of Supercomputing.

[9]  Johan Montagnat,et al.  Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science , 2013, HiPC 2013.

[10]  David Abramson,et al.  Economic models for resource management and scheduling in Grid computing , 2002, Concurr. Comput. Pract. Exp..

[11]  Gregor von Laszewski,et al.  Workflow Concepts of the Java CoG Kit , 2005, Journal of Grid Computing.

[12]  Soonwook Hwang,et al.  A Flexible Framework for Fault Tolerance in the Grid , 2003, Journal of Grid Computing.

[13]  Jemal H. Abawajy,et al.  Fault-tolerant scheduling policy for grid computing systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[14]  Rajkumar Buyya,et al.  Economic-based Distributed Resource Management and Scheduling for Grid Computing , 2002, ArXiv.

[15]  Soonwook Hwang,et al.  Grid workflow: a flexible failure handling framework for the grid , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[16]  Felix C. Gärtner,et al.  Fundamentals of fault-tolerant distributed computing in asynchronous environments , 1999, CSUR.

[17]  Rajkumar Buyya,et al.  Gridbus Workflow Enactment Engine , 2007 .

[18]  Song Jiang,et al.  Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[19]  Nikitas J. Dimopoulos,et al.  Intelligent Selection of Fault Tolerance Techniques on the Grid , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[20]  Cosimo Anglano,et al.  Fault-Tolerant Scheduling for Bag-of-Tasks Grid Applications , 2005, EGC.

[21]  Li Chunlin,et al.  A system-centric scheduling policy for optimizing objectives of application and resource in grid computing , 2009 .

[22]  Radu Prodan,et al.  ASKALON: a tool set for cluster and Grid computing , 2005, Concurr. Pract. Exp..

[23]  Rajkumar Buyya,et al.  A Taxonomy of Workflow Management Systems for Grid Computing , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.