Fail-stop Failure Recovery in Neighbor Replica Environment

Abstract Failure recovery is a nontrivial property for current distributed systems. An autonomous failure recovery in a distributed system is the ability of a system to execute self-corrective action when an instance or a subset of the system becomes faulty. However, autonomous failure recovery in current large distributed system is a very complicated procedure and often complicated to implement. In order to achieve a high level of reliability and availability in current distributed environment,This paper presents an autonomous, self-configured fail-stop failure recovery model. This model utilized the advantages of the distributed neighbor replica technique (NRT). In this paper, the algorithm along with theoretical framework for autonomous failure recovery are illustrated. This paper propose a resource manager for optimal resource selection. In the event of a resource failure, the resource manager autonomously decide on a resource among a faulty resource neighbors and auto-reconfigure the system. This selection is based on certain reliability parameters or criteria. This paper also illustrates a prototype model implementation. The model also demonstrate that this model is theoretically sound with the ability to perform autonomous recovery smoothly by quickly reconfiguring its services upon detection of failure