Building a High Serviceability Model by Checkpointing and Replication Strategy in Cloud Computing Environments

High fault tolerance issue is one of the major obstacles for opening up the new era of high serviceability cloud computing as fault tolerance plays a key role in order to ensure cloud serviceability. In most current clouds, check pointing, the process of saving application states, and replication, the process of replicating hot data, usually to stable storage, have been the two most common fault tolerance strategies. However, when, where, and how often to insert check pointing or to replicate hot data have become challenges and are ignored in clouds. In this paper, the definitions of fault, error, and failure in a cloud are given, a high serviceability model by check pointing and replication strategy HSCR is put forward. It includes: (1) analyzing the mathematical relationship between different failure rates and two different fault tolerance strategies, which are check pointing fault tolerance strategy and data replication fault tolerance strategy, (2) building a high serviceability check pointing fault tolerance model and a high serviceability replication fault tolerance model by combining the two fault tolerance models together to maximize the serviceability and meet the SLOs. Experimental results conclusively demonstrate that the high serviceability model HSCR has high potential as it provides efficient fault tolerance enhancements, significant cloud serviceability improvement, and great SLOs satisfaction.

[1]  Shu-Chin Wang,et al.  Achieving efficient agreement within a dual-failure cloud-computing environment , 2011, Expert Syst. Appl..

[2]  Axel W. Krings,et al.  Dynamic Hybrid Fault Modeling and Extended Evolutionary Game Theory for Reliability, Survivability and Fault Tolerance Analyses , 2011, IEEE Transactions on Reliability.

[3]  Paul D. Manuel,et al.  A hybrid fault tolerance technique in grid computing system , 2011, The Journal of Supercomputing.

[4]  D. Manivannan,et al.  Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E families , 2011, Perform. Evaluation.

[5]  Raja Nassar,et al.  High Performance Computing Systems with Various Checkpointing Schemes , 2009, Int. J. Comput. Commun. Control.

[6]  Xiao Liu,et al.  A data placement strategy in scientific cloud workflows , 2010, Future Gener. Comput. Syst..

[7]  Zibin Zheng,et al.  BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource Cloud Computing , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[8]  Zibin Zheng,et al.  FTCloud: A Component Ranking Framework for Fault-Tolerant Cloud Applications , 2010, 2010 IEEE 21st International Symposium on Software Reliability Engineering.

[9]  Rajkumar Buyya,et al.  Article in Press Future Generation Computer Systems ( ) – Future Generation Computer Systems Cloud Computing and Emerging It Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility , 2022 .

[10]  GhemawatSanjay,et al.  The Google file system , 2003 .

[11]  Hai Jin,et al.  Live Virtual Machine Migration via Asynchronous Replication and State Synchronization , 2011, IEEE Transactions on Parallel and Distributed Systems.

[12]  Xiaola Lin,et al.  A Variational Calculus Approach to Optimal Checkpoint Placement , 2001, IEEE Trans. Computers.

[13]  Gagan Agrawal,et al.  Supporting fault-tolerance for time-critical events in distributed environments , 2010 .

[14]  Kuen-Fang Jea,et al.  A near-optimal database allocation for reducing the average waiting time in the grid computing environment , 2009, Inf. Sci..

[15]  Chao-Tung Yang,et al.  Improving reliability of a heterogeneous grid-based intrusion detection platform using levels of redundancies , 2010, Future Gener. Comput. Syst..

[16]  Hai Jin,et al.  Adapting grid applications to safety using fault-tolerant methods: Design, implementation and evaluations , 2010, Future Gener. Comput. Syst..

[17]  Yung Ting,et al.  Low Overhead Incremental Checkpointing and Rollback Recovery Scheme on Windows Operating System , 2010, 2010 Third International Conference on Knowledge Discovery and Data Mining.

[18]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[19]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[20]  Mohamed Jmaiel,et al.  A survey on software checkpointing and mobility techniques in distributed systems , 2011, Concurr. Comput. Pract. Exp..

[21]  Daeyong Jung,et al.  An Effective Job Replication Technique Based on Reliability and Performance in Mobile Grids , 2010, GPC.

[22]  Subhajyoti Bandyopadhyay,et al.  Cloud Computing - The Business Perspective , 2011, 2011 44th Hawaii International Conference on System Sciences.

[23]  Rajkumar Buyya,et al.  CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms , 2011, Softw. Pract. Exp..

[24]  Zhiling Lan,et al.  FREM: A Fast Restart Mechanism for General Checkpoint/Restart , 2011, IEEE Transactions on Computers.

[25]  Ruay-Shiung Chang,et al.  A dynamic data replication strategy using access-weights in data grids , 2008, The Journal of Supercomputing.

[26]  Baomin Xu,et al.  Job scheduling algorithm based on Berger model in cloud environment , 2011, Adv. Eng. Softw..

[27]  Bhavani M. Thuraisingham,et al.  Secure Data Objects Replication in Data Grid , 2010, IEEE Transactions on Dependable and Secure Computing.

[28]  Indrajit Ray,et al.  An interoperable context sensitive model of trust , 2009, Journal of Intelligent Information Systems.

[29]  Xiaoyan Hong,et al.  An on-line replication strategy to increase availability in Data Grids , 2008, Future Gener. Comput. Syst..