Model defined fault tolerance in cloud

Fault tolerance (FT) is one of the most important ways to achieve high availability (HA). However, as for cloud, with diverse user requirements, heterogeneous cloud providers, complex FT implementation as well as error-prone configuration, it is a real challenge. To cope with it, we proposed a model defined FT approach which automatically deploys FT mechanisms following a high-level model. With the help of FT model, the existing FT mechanisms will be optimized by reusability. We implemented a prototype of our approach and evaluated it on a popular IaaS cloud - CloudStack.

[1]  Christian Engelmann,et al.  Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[2]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[3]  Richard N. Taylor,et al.  Towards architecture-based self-healing systems , 2002, WOSS '02.

[4]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[5]  S. Karthik,et al.  A fault tolerent approach in scientific workflow systems based on cloud computing , 2013, 2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering.

[6]  Amal Ganesh,et al.  A study on fault tolerance methods in Cloud Computing , 2014, 2014 IEEE International Advance Computing Conference (IACC).

[7]  Rogério de Lemos,et al.  An architectural support for self-adaptive software for treating faults , 2002, WOSS '02.

[8]  Steffen Becker,et al.  SimuLizar: Design-Time Modeling and Performance Analysis of Self-Adaptive Systems , 2013, Software Engineering.

[9]  Bradley R. Schmerl,et al.  Using Architectural Style as a Basis for System Self-repair , 2002, WICSA.

[10]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[11]  D K Pradhan Design of Fault-Tolerant Computers Using ROM as Basic Building Block. , 1980 .

[12]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[13]  Miguel Correia,et al.  Highly Available Intrusion-Tolerant Services with Proactive-Reactive Recovery , 2010, IEEE Transactions on Parallel and Distributed Systems.