Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments

Failures are normal rather than exceptional in cloud computing environments, high fault tolerance issue is one of the major obstacles for opening up a new era of high serviceability cloud computing as fault tolerance plays a key role in ensuring cloud serviceability. Fault tolerant service is an essential part of Service Level Objectives (SLOs) in clouds. To achieve high level of cloud serviceability and to meet high level of cloud SLOs, a foolproof fault tolerance strategy is needed. In this paper, the definitions of fault, error, and failure in a cloud are given, and the principles for high fault tolerance objectives are systematically analyzed by referring to the fault tolerance theories suitable for large-scale distributed computing environments. Based on the principles and semantics of cloud fault tolerance, a dynamic adaptive fault tolerance strategy DAFT is put forward. It includes: (i) analyzing the mathematical relationship between different failure rates and two different fault tolerance strategies, which are checkpointing fault tolerance strategy and data replication fault tolerance strategy; (ii) building a dynamic adaptive checkpointing fault tolerance model and a dynamic adaptive replication fault tolerance model by combining the two fault tolerance models together to maximize the serviceability and meet the SLOs; and (iii) evaluating the dynamic adaptive fault tolerance strategy under various conditions in large-scale cloud data centers and consider different system centric parameters, such as fault tolerance degree, fault tolerance overhead, response time, etc. Theoretical as well as experimental results conclusively demonstrate that the dynamic adaptive fault tolerance strategy DAFT has high potential as it provides efficient fault tolerance enhancements, significant cloud serviceability improvement, and great SLOs satisfaction. It efficiently and effectively achieves a trade-off for fault tolerance objectives in cloud computing environments.

[1]  Alexandru Iosup,et al.  Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing , 2011, IEEE Transactions on Parallel and Distributed Systems.

[2]  Ghalem Belalem,et al.  Approaches to Improve the Resources Management in the Simulator CloudSim , 2010, ICICA.

[3]  Rajkumar Buyya,et al.  Modeling and simulation of scalable Cloud computing environments and the CloudSim toolkit: Challenges and opportunities , 2009, 2009 International Conference on High Performance Computing & Simulation.

[4]  Seyed Ahmad Motamedi,et al.  Adaptive Two-Level Blocking Coordinated Checkpointing for High Performance Cluster Computing Systems , 2010, J. Inf. Sci. Eng..

[5]  Ge-Ming Chiu,et al.  A New Diskless Checkpointing Approach for Multiple Processor Failures , 2011, IEEE Transactions on Dependable and Secure Computing.

[6]  Subhajyoti Bandyopadhyay,et al.  Cloud computing - The business perspective , 2011, Decis. Support Syst..

[7]  Hai Jin,et al.  Adapting grid applications to safety using fault-tolerant methods: Design, implementation and evaluations , 2010, Future Gener. Comput. Syst..

[8]  Yung Ting,et al.  Low Overhead Incremental Checkpointing and Rollback Recovery Scheme on Windows Operating System , 2010, 2010 Third International Conference on Knowledge Discovery and Data Mining.

[9]  Zibin Zheng,et al.  BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource Cloud Computing , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[10]  Zibin Zheng,et al.  FTCloud: A Component Ranking Framework for Fault-Tolerant Cloud Applications , 2010, 2010 IEEE 21st International Symposium on Software Reliability Engineering.

[11]  Shu-Chin Wang,et al.  Achieving efficient agreement within a dual-failure cloud-computing environment , 2011, Expert Syst. Appl..

[12]  D. Manivannan,et al.  Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E families , 2011, Perform. Evaluation.

[13]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[14]  Daeyong Jung,et al.  An Effective Job Replication Technique Based on Reliability and Performance in Mobile Grids , 2010, GPC.

[15]  Subhajyoti Bandyopadhyay,et al.  Cloud Computing - The Business Perspective , 2011, 2011 44th Hawaii International Conference on System Sciences.

[16]  Kalim Qureshi,et al.  Performance evaluation of fault tolerance techniques in grid computing system , 2010, Comput. Electr. Eng..

[17]  Axel W. Krings,et al.  Dynamic Hybrid Fault Modeling and Extended Evolutionary Game Theory for Reliability, Survivability and Fault Tolerance Analyses , 2011, IEEE Transactions on Reliability.

[18]  Hai Jin,et al.  Live Virtual Machine Migration via Asynchronous Replication and State Synchronization , 2011, IEEE Transactions on Parallel and Distributed Systems.

[19]  Vijay K. Garg,et al.  Efficient Algorithms for Global Snapshots in Large Distributed Systems , 2010, IEEE Transactions on Parallel and Distributed Systems.

[20]  Raja Nassar,et al.  High Performance Computing Systems with Various Checkpointing Schemes , 2009, Int. J. Comput. Commun. Control.

[21]  Rajkumar Buyya,et al.  CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms , 2011, Softw. Pract. Exp..

[22]  Xiao Liu,et al.  A data placement strategy in scientific cloud workflows , 2010, Future Gener. Comput. Syst..

[23]  Zhiling Lan,et al.  FREM: A Fast Restart Mechanism for General Checkpoint/Restart , 2011, IEEE Transactions on Computers.

[24]  Chao-Tung Yang,et al.  Improving reliability of a heterogeneous grid-based intrusion detection platform using levels of redundancies , 2010, Future Gener. Comput. Syst..

[25]  Chao Wang,et al.  Hybrid Checkpointing for MPI Jobs in HPC Environments , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[26]  GhemawatSanjay,et al.  The Google file system , 2003 .

[27]  Islene C. Garcia,et al.  Diskless Checkpointing with Rollback-Dependency Trackability , 2010, 2010 29th IEEE Symposium on Reliable Distributed Systems.

[28]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[29]  Xiaola Lin,et al.  A Variational Calculus Approach to Optimal Checkpoint Placement , 2001, IEEE Trans. Computers.

[30]  Gagan Agrawal,et al.  Supporting fault-tolerance for time-critical events in distributed environments , 2010 .

[31]  Ruay-Shiung Chang,et al.  A dynamic data replication strategy using access-weights in data grids , 2008, The Journal of Supercomputing.

[32]  Paul D. Manuel,et al.  A hybrid fault tolerance technique in grid computing system , 2011, The Journal of Supercomputing.

[33]  Bhavani M. Thuraisingham,et al.  Secure Data Objects Replication in Data Grid , 2010, IEEE Transactions on Dependable and Secure Computing.

[34]  Indrajit Ray,et al.  An interoperable context sensitive model of trust , 2009, Journal of Intelligent Information Systems.

[35]  Cheol-Hoon Lee,et al.  Energy-Aware Real-Time Task Scheduling Exploiting Temporal Locality , 2010, IEICE Trans. Inf. Syst..

[36]  Xiaoyan Hong,et al.  An on-line replication strategy to increase availability in Data Grids , 2008, Future Gener. Comput. Syst..

[37]  Ekpe Okorafor,et al.  A Fault-Tolerant High Performance Cloud Strategy for Scientific Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[38]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[39]  Mohamed Jmaiel,et al.  A survey on software checkpointing and mobility techniques in distributed systems , 2011, Concurr. Comput. Pract. Exp..

[40]  Rajkumar Buyya,et al.  Article in Press Future Generation Computer Systems ( ) – Future Generation Computer Systems Cloud Computing and Emerging It Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility , 2022 .

[41]  Filip De Turck,et al.  Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids , 2009, IEEE Transactions on Parallel and Distributed Systems.

[42]  Baomin Xu,et al.  Job scheduling algorithm based on Berger model in cloud environment , 2011, Adv. Eng. Softw..

[43]  Kuen-Fang Jea,et al.  A near-optimal database allocation for reducing the average waiting time in the grid computing environment , 2009, Inf. Sci..