Fault Tolerance and Recovery for Grid Application Reliability using Check Pointing Mechanism

The check pointing mechanism and rollback recovery is a well-known method to achieve fault tolerance in grid computing systems. If any resource or process is tending to be faulty in run time that will be detected by check pointing mechanism through the Task Dependency Graph (TDG) and their respective worst case execution time and deadline parameters are used to decide the schedulability. The common approach is to use rollback-dependent graph or check point graph. The scheduling of concurrent tasks can be done using the proposed Concurrent Task Scheduling Algorithm (CTSA) algorithm to recover from the faulty states using replication or rollback techniques. The earlier fault detection methods are not scalable with the diversity of user applications and the frequency of faults varies dynamically making the faults hard to detect and recover. The check pointing and replication mechanisms have been used in high performance grid computing where the synchronization between communicating processes is needed to enhance the efficiency of check pointing mechanism. The performance improvements over the faulty conditions can be obtained with or without data and process replication. The experimental results show that the CTSA can lead to significant performance gain for a variety of scenarios.

[1]  Krishnendu Mukhopadhyaya,et al.  Performance analysis of different checkpointing and recovery schemes using stochastic model , 2006, J. Parallel Distributed Comput..

[2]  M. Prakash,et al.  Fault Tolerance-Genetic Algorithm for Grid Task Scheduling using Check Point , 2007, Sixth International Conference on Grid and Cooperative Computing (GCC 2007).

[3]  Marco Furini,et al.  International Journal of Computer and Applications , 2010 .

[4]  Chandrasekaran Subramaniam,et al.  On demand check pointing for grid application reliability using communicating process model , 2011, 13th International Conference on Advanced Communication Technology (ICACT2011).

[5]  Kobra Etminani,et al.  A Min-Min Max-Min Selective Algorithm for Grid Task Scheduling , 2007, 2007 3rd IEEE/IFIP International Conference in Central Asia on Internet.

[6]  J. Jayabharathy,et al.  A Fault Tolerant Load Balancing Model for Grid Environment , .

[7]  Soonwook Hwang,et al.  A Flexible Framework for Fault Tolerance in the Grid , 2003, Journal of Grid Computing.

[8]  MandalPartha Sarathi,et al.  Performance analysis of different checkpointing and recovery schemes using stochastic model , 2006 .

[9]  V. R. Uthariaraj,et al.  FAULT TOLERANT SCHEDULING STRATEGY FOR COMPUTATIONAL GRID ENVIRONMENT , 2010 .

[10]  J. A Fault Tolerant Load Balancing Model for Grid Environment , 2009 .

[11]  Saeed Parsa,et al.  RASA-A New Grid Task Scheduling Algorithm , 2009, J. Digit. Content Technol. its Appl..

[12]  Rajkumar Buyya,et al.  A taxonomy and survey of grid resource management systems for distributed computing , 2002, Softw. Pract. Exp..

[13]  Daniel A. Reed,et al.  Fault Tolerance and Recovery of Scientific Workflows on Computational Grids , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[14]  Youcef Derbal A new fault-tolerance framework for grid computing , 2006, Multiagent Grid Syst..

[15]  Selim G. Akl,et al.  Scheduling Algorithms for Grid Computing: State of the Art and Open Problems , 2006 .

[16]  Rajkumar Buyya,et al.  A taxonomy of scientific workflow systems for grid computing , 2005, SGMD.