An Optimized Policy for Automatic Failure Recovery in Microrebootable Distributed Systems

To overcome the challenges of recovery polices generation in the presence of inaccurate failure detection, a failure recovery model for microrebootable distributed systems based on discounted Partially Observable Markov Decision Processes is presented in this paper. Thus the reasonable recovery policies are generated by solving the POMDP model. To tackle the problem of computational complexity of exact solution, a value function approximate solution called fast informed bound solution is used for the near-optimal policies. Simulation-based experimental results on a realistic network security situation prediction system demonstrate that the proposed model can be solved effectively, and the resulting policies convincingly outperform others.

[1]  Sanjit A. Seshia Autonomic Reactive Systems via Online Learning , 2007, Fourth International Conference on Autonomic Computing (ICAC'07).

[2]  Milos Hauskrecht,et al.  Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[3]  William H. Sanders,et al.  Automatic Recovery Using Bounded Partially Observable Markov Decision Processes , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[4]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .

[5]  Hong Zhang,et al.  Research on the Application of Artificial Neural Network in the Fine-Grained Software Rejuvenation of Computing System: Research on the Application of Artificial Neural Network in the Fine-Grained Software Rejuvenation of Computing System , 2009 .

[6]  Wei Xie,et al.  Analysis of a two-level software rejuvenation policy , 2005, Reliab. Eng. Syst. Saf..

[7]  Chun Yuan,et al.  A Reinforcement Learning Approach to Automatic Error Recovery , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[8]  George Candea,et al.  Reducing recovery time in a small recursively restartable system , 2002, Proceedings International Conference on Dependable Systems and Networks.

[9]  Roy Sterritt,et al.  Challenges of Developing New Classes of NASA Self-Managing Missions , 2005, 11th International Conference on Parallel and Distributed Systems (ICPADS'05).

[10]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[11]  William H. Sanders,et al.  Automatic model-driven recovery in distributed systems , 2005, 24th IEEE Symposium on Reliable Distributed Systems (SRDS'05).

[12]  Guo Cheng Research on the Application of Artificial Neural Network in the Fine-Grained Software Rejuvenation of Computing System , 2008 .

[13]  George Candea,et al.  Autonomous recovery in componentized Internet applications , 2006, Cluster Computing.

[14]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.