论文信息 - Fail-stop Failure Recovery in Neighbor Replica Environment

Fail-stop Failure Recovery in Neighbor Replica Environment

Abstract Failure recovery is a nontrivial property for current distributed systems. An autonomous failure recovery in a distributed system is the ability of a system to execute self-corrective action when an instance or a subset of the system becomes faulty. However, autonomous failure recovery in current large distributed system is a very complicated procedure and often complicated to implement. In order to achieve a high level of reliability and availability in current distributed environment,This paper presents an autonomous, self-configured fail-stop failure recovery model. This model utilized the advantages of the distributed neighbor replica technique (NRT). In this paper, the algorithm along with theoretical framework for autonomous failure recovery are illustrated. This paper propose a resource manager for optimal resource selection. In the event of a resource failure, the resource manager autonomously decide on a resource among a faulty resource neighbors and auto-reconfigure the system. This selection is based on certain reliability parameters or criteria. This paper also illustrates a prototype model implementation. The model also demonstrate that this model is theoretically sound with the ability to perform autonomous recovery smoothly by quickly reconfiguring its services upon detection of failure

Mustafa Mat Deris | Ahmad Shukri Mohd Noor

[1] Bettina Schnor,et al. Migol: A fault-tolerant service framework for MPI applications in the grid , 2008, Future Gener. Comput. Syst..

[2] Mustafa Mat Deris,et al. Neighbor Replica Distribution Technique for Cluster Server Systems , 2004 .

[3] S. Siva Sathya,et al. Survey of fault tolerant techniques for grid , 2010, Comput. Sci. Rev..

[4] Carl E. Landwehr,et al. Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[5] Zhanhuai Li,et al. A Fast Disaster Recovery Mechanism for Volume Replication Systems , 2007, HPCC.

[6] Kalim Qureshi,et al. Performance evaluation of fault tolerance techniques in grid computing system , 2010, Comput. Electr. Eng..