ADecentralizedFault Tolerant model for Grid Computing

A current trend in high-performance computing is the use of large-scale computing grids. These platforms consist of geographically distributed cluster federations gathering thousands of nodes. At this scale, node and network failures are no more exceptions, but belong to the normal system behavior. Thus, grid applications must tolerate failures and their evaluation should take reaction to failures into account. The failures of distributed computing system can be divided into three categories: node crash, ne twork failure and process fault. The fault tolerance is a significant and complex issue in grid computing systems. Various techniques have been investigated to detect and tolerate faults in distributed computing systems. We propose, in this paper, a decentralized model of fault tolerance based on dynamic colored graphs. From this model, we show through some experiments, the benefits of colored graphs to manage failures in grids.

[1]  Ritu Garg,et al.  Fault TOLERANCE IN GRID COMPUTING : STATE OF THE ART AND OPEN ISSUES , 2011 .

[2]  Inderveer Chana,et al.  Fault Tolerance- Challenges, Techniques and Implementation in Cloud Computing , 2012 .

[3]  Ian T. Foster,et al.  MPICH-G2: A Grid-enabled implementation of the Message Passing Interface , 2002, J. Parallel Distributed Comput..

[4]  Hai Jin,et al.  Fault-tolerant grid architecture and practice , 2003, Journal of Computer Science and Technology.

[5]  Mohamed Jemni,et al.  A decentralized and fault‐tolerant Desktop Grid system for distributed applications , 2010, Concurr. Comput. Pract. Exp..

[6]  Shanshan Song,et al.  Trusted Grid Computing with Security Binding and Trust Integration , 2005, Journal of Grid Computing.

[7]  Ciprian Dobre,et al.  RE-SCHEDULING AND ERROR RECOVERING ALGORITHM FOR DISTRIBUTED ENVIRONMENTS , 2011 .

[8]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[9]  Jemal H. Abawajy,et al.  Fault-tolerant scheduling policy for grid computing systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[10]  Congfeng Jiang,et al.  Replication Based Job Scheduling in Grids with Security Assurance , 2010 .

[11]  Mrs. Radha,et al.  A Detailed Study of Resource Scheduling and Fault Tolerance in Grid , 2011 .

[12]  Cheng Wang,et al.  A Fuzzy Logic Approach for Secure and Fault Tolerant Grid Job Scheduling , 2007, ATC.

[13]  Hai Jin,et al.  DRIC: Dependable Grid Computing Framework , 2006, IEICE Trans. Inf. Syst..

[14]  S. Ramanathan,et al.  A Resilient Telco Grid Middleware , 2006, 11th IEEE Symposium on Computers and Communications (ISCC'06).

[15]  Ian Foster,et al.  The Globus toolkit , 1998 .

[16]  Shaohua Zhang,et al.  Dynamic Replica Location Service Supporting Data Grid Systems , 2006, The Sixth IEEE International Conference on Computer and Information Technology (CIT'06).

[17]  Thakur Kapil Singh,et al.  Fault Tolerance- Challenges, Techniques and Implementation in Cloud Computing , 2013 .

[18]  Sri Ramakrishna A Detailed Study of Resource Scheduling and Fault Tolerance in Grid , 2011 .

[19]  Ali Ghaffari,et al.  Reliable Job Scheduler using RFOH in Grid Computing , 2010 .

[20]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[21]  Marios D. Dikaiakos,et al.  Searching for Software on the EGEE Infrastructure , 2010, Journal of Grid Computing.

[22]  Ramesh K. Sitaraman,et al.  Assessing the vulnerability of replicated network services , 2010, CoNEXT.

[23]  S. Siva Sathya,et al.  Survey of fault tolerant techniques for grid , 2010, Comput. Sci. Rev..

[24]  Henri E. Bal,et al.  Transparent Fault Tolerance for Grid Applications , 2005, EGC.

[25]  Carl Kesselman,et al.  Performance and scalability of a replica location service , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[26]  Heon Young Yeom,et al.  MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes , 2004, IEICE Trans. Inf. Syst..

[27]  Sajal K. Das,et al.  Graph partitioning for parallel applications in heterogeneous Grid environments , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[28]  Muthucumaru Maheswaran,et al.  Integrating trust into grid resource management systems , 2002, Proceedings International Conference on Parallel Processing.

[29]  Antoine Dutot,et al.  GraphStream: A Tool for bridging the gap between Complex Systems and Dynamic Graphs , 2008, ArXiv.

[30]  Patricia González,et al.  Application-Level Fault-Tolerance Solutions for Grid Computing , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[31]  Axel W. Krings,et al.  Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing , 2009, IEEE Transactions on Dependable and Secure Computing.

[32]  Mohamed Jemni,et al.  Controlling processing usage at user level: a way to make resource sharing more flexible , 2010 .

[33]  Bongjae Kim,et al.  Using replication and checkpointing for reliable task management in computational Grids , 2010, 2010 International Conference on High Performance Computing & Simulation.

[34]  Aliaa A. A. Youssif,et al.  An Efficient Decentralized Grid Service Advertisement Approach Using Multi-Agent System , 2010, Comput. Inf. Sci..