Fault Tolerance in Distributed Systems: A Survey

Distributed systems can be homogeneous (cluster), or heterogeneous such as Grid, Cloud and P2P. Several problems can occur in these types of systems, such as quality of service (QoS), resource selection, load balancing and fault tolerance.Fault tolerance is a main subject regarding the design of distributed systems. When a hardware or software failure occurs in the system, it causes a failure and we call it, in this case, a fault. Moreover, in order to allow the system to continue its functionalities, even in the presence of these faults, they must find techniques, which tolerate failure; the goal of these techniques is to detect and to correct these errors.In this paper, we introduce at first an overview of the basic concepts of distributed systems and their failures types, then we present, in a detailed manner, the different techniques that tolerate fault, used to identify and to correct faults in different kinds of systems such as: cluster, grid computing, Cloud and P2P systems.

[1]  J. Salter,et al.  An Efficient Fault Tolerant Approach to Resource Discovery in P2P Networks , 2004 .

[2]  Andrew S. Tanenbaum,et al.  Distributed systems: Principles and Paradigms , 2001 .

[3]  V. Piuri,et al.  A comprehensive conceptual system-level approach to fault tolerance in Cloud Computing , 2012, 2012 IEEE International Systems Conference SysCon 2012.

[4]  Angelos D. Keromytis,et al.  ASSURE: automatic software self-healing using rescue points , 2009, ASPLOS.

[5]  Naveed Riaz Ansari,et al.  Fault Tolerance in Distributed Paradigms , 2022 .

[6]  Ian T. Foster Globus Toolkit Version 4: Software for Service-Oriented Systems , 2005, NPC.

[7]  Sofiane Mounine Hemam,et al.  Optimizing Both the User Requirements and the Load Balancing in the Volunteer Computing System by using Markov Chain Model , 2018, Int. J. Enterp. Inf. Syst..

[8]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 2004, Cluster Computing.

[9]  Ronald Minnich,et al.  A Network-Failure-Tolerant Message-Passing System for Terascale Clusters , 2002, ICS '02.

[10]  Gang Chen,et al.  SHelp: Automatic Self-Healing for Multiple Application Instances in a Virtual Machine Environment , 2010, 2010 IEEE International Conference on Cluster Computing.

[11]  John Kubiatowicz,et al.  Asymptotically Efficient Approaches to Fault-Tolerance in Peer-to-Peer Networks , 2003, DISC.

[12]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[13]  Harshpreet Singh,et al.  Review on Fault Tolerance Techniques in Cloud Computing , 2015 .

[14]  Soonwook Hwang,et al.  A Flexible Framework for Fault Tolerance in the Grid , 2003, Journal of Grid Computing.

[15]  Laxmikant V. Kalé,et al.  Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.

[16]  Chita Ranjan Tripathy,et al.  Performance modelling and analysis of mobile grid computing systems , 2014, Int. J. Grid Util. Comput..

[17]  M. A. Ansari,et al.  Distributed Fault Management for Computational Grids , 2006, 2006 Fifth International Conference on Grid and Cooperative Computing (GCC'06).

[18]  Ian T. Foster,et al.  Globus Toolkit Version 4: Software for Service-Oriented Systems , 2005, Journal of Computer Science and Technology.