Failover strategy for fault tolerance in cloud computing environment

Cloud fault tolerance is an important issue in cloud computing platforms and applications. In the event of an unexpected system failure or malfunction, a robust fault‐tolerant design may allow the cloud to continue functioning correctly possibly at a reduced level instead of failing completely. To ensure high availability of critical cloud services, the application execution, and hardware performance, various fault‐tolerant techniques exist for building self‐autonomous cloud systems. In comparison with current approaches, this paper proposes a more robust and reliable architecture using optimal checkpointing strategy to ensure high system availability and reduced system task service finish time. Using pass rates and virtualized mechanisms, the proposed smart failover strategy (SFS) scheme uses components such as cloud fault manager, cloud controller, cloud load balancer, and a selection mechanism, providing fault tolerance via redundancy, optimized selection, and checkpointing. In our approach, the cloud fault manager repairs faults generated before the task time deadline is reached, blocking unrecoverable faulty nodes as well as their virtual nodes. This scheme is also able to remove temporary software faults from recoverable faulty nodes, thereby making them available for future request. We argue that the proposed SFS algorithm makes the system highly fault tolerant by considering forward and backward recovery using diverse software tools. Compared with existing approaches, preliminary experiment of the SFS algorithm indicates an increase in pass rates and a consequent decrease in failure rates, showing an overall good performance in task allocations. We present these results using experimental validation tools with comparison with other techniques, laying a foundation for a fully fault‐tolerant infrastructure as a service cloud environment. Copyright © 2017 John Wiley & Sons, Ltd.

[1]  Franck Cappello,et al.  GloudSim: Google trace based cloud simulator with virtual machines , 2015, Softw. Pract. Exp..

[2]  K. Jairam Naik,et al.  A novel fault-tolerant task scheduling algorithm for computational grids , 2013, 2013 15th International Conference on Advanced Computing Technologies (ICACT).

[3]  Yaser Jararweh,et al.  TeachCloud: a cloud computing educational toolkit , 2013, Int. J. Cloud Comput..

[4]  Valentin Cristea,et al.  FIM-SIM: Fault Injection Module for CloudSim Based on Statistical Distributions , 2014 .

[5]  Chao-Tung Yang,et al.  Implementation of a Cloud IaaS with Dynamic Resource Allocation Method Using OpenStack , 2013, 2013 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[6]  Rajkumar Buyya,et al.  Modeling and simulation of scalable Cloud computing environments and the CloudSim toolkit: Challenges and opportunities , 2009, 2009 International Conference on High Performance Computing & Simulation.

[7]  Filip De Turck,et al.  Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids , 2009, IEEE Transactions on Parallel and Distributed Systems.

[8]  S. Siva Sathya,et al.  Survey of fault tolerant techniques for grid , 2010, Comput. Sci. Rev..

[9]  Laurent Broto,et al.  Approaches to cloud computing fault tolerance , 2012, 2012 International Conference on Computer, Information and Telecommunication Systems (CITS).

[10]  Pabitra Mohan Khilar,et al.  VFT: A virtualization and fault tolerance approach for cloud computing , 2013, 2013 IEEE CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGIES.

[11]  Liming Zhu,et al.  Runtime recovery actions selection for sporadic operations on public cloud , 2017, Softw. Pract. Exp..

[12]  Alexandru Iosup,et al.  An Availability-on-Demand Mechanism for Datacenters , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[13]  Calton Pu,et al.  Performance and availability aware regeneration for cloud based multitier applications , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[14]  Mariam Kiran,et al.  Analysis of Cloud Test Beds Using OpenSource Solutions , 2015, 2015 3rd International Conference on Future Internet of Things and Cloud.

[15]  Sanjay Kumar Madria,et al.  Off-Line Risk Assessment of Cloud Service Provider , 2014, 2014 IEEE World Congress on Services.

[16]  Chuang Lin,et al.  Performance, Fault-Tolerance and Scalability Analysis of Virtual Infrastructure Management System , 2009, 2009 IEEE International Symposium on Parallel and Distributed Processing with Applications.

[17]  Ekpe Okorafor,et al.  A Fault-Tolerant High Performance Cloud Strategy for Scientific Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[18]  Yijie Wang,et al.  Repairing multiple failures adaptively with erasure codes in distributed storage systems , 2016, Concurr. Comput. Pract. Exp..

[19]  Yadav Sonali,et al.  Comparative Study on Open Source Software for Cloud Computing Platform: Eucalyptus, Openstack and Opennebula , 2013 .

[20]  Carlo Bertolli,et al.  Fault tolerance for data parallel programs , 2011, Concurr. Comput. Pract. Exp..

[21]  L. Breuer Introduction to Stochastic Processes , 2022, Statistical Methods for Climate Scientists.

[22]  Inderveer Chana,et al.  Fault Tolerance- Challenges, Techniques and Implementation in Cloud Computing , 2012 .

[23]  Albert G. Greenberg,et al.  The cost of a cloud: research problems in data center networks , 2008, CCRV.

[24]  Roy Friedman,et al.  Virtual machine based heterogeneous checkpointing , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[25]  Christian Engelmann,et al.  Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[26]  Mohsine Eleuldj,et al.  Cloud computing migration and IT resources rationalization , 2014, 2014 International Conference on Multimedia Computing and Systems (ICMCS).

[27]  Bran Selic,et al.  A Proactive Fault Tolerance Approach to High Performance Computing (HPC) in the Cloud , 2012, 2012 Second International Conference on Cloud and Green Computing.

[28]  Rajkumar Buyya,et al.  CloudAnalyst: A CloudSim-Based Visual Modeller for Analysing Cloud Computing Environments and Applications , 2010, 2010 24th IEEE International Conference on Advanced Information Networking and Applications.

[29]  Yunni Xia,et al.  A probabilistic model for performance analysis of cloud infrastructures , 2015, Concurr. Comput. Pract. Exp..

[30]  Shangguang Wang,et al.  FTCloudSim: a simulation tool for cloud service reliability enhancement mechanisms , 2013, MiddlewareDPT '13.

[31]  Danny Raz,et al.  Cost aware fault recovery in clouds , 2013, 2013 IFIP/IEEE International Symposium on Integrated Network Management (IM 2013).

[32]  Andrzej M. Goscinski,et al.  A survey and review of the current state of rollback‐recovery for cluster systems , 2009, Concurr. Comput. Pract. Exp..

[33]  Dan Grigoras,et al.  Robust Cloud Management of MANET Checkpoint Sessions , 2015, 2015 14th International Symposium on Parallel and Distributed Computing.

[34]  Raja Nassar,et al.  High Performance Computing Systems with Various Checkpointing Schemes , 2009, Int. J. Comput. Commun. Control.

[35]  Bran Selic,et al.  A Fault Tolerance Framework for High Performance Computing in Cloud , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[36]  Robert G. Gallager,et al.  Discrete Stochastic Processes , 1995 .

[37]  Mohammed Amoon,et al.  A job checkpointing system for computational grids , 2013, Central European Journal of Computer Science.

[38]  Nazari CheraghlouMehdi,et al.  A survey of fault tolerance architecture in cloud computing , 2016 .

[39]  Ahmad Khademzadeh,et al.  A survey of fault tolerance architecture in cloud computing , 2016, J. Netw. Comput. Appl..

[40]  Stephen L. Scott,et al.  An optimal checkpoint/restart model for a large scale high performance computing system , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[41]  S. K. Pandey,et al.  FAULT TOLERANCE IN DCDIDP USING HAProxy , 2013 .

[42]  J. Singh,et al.  High Availability of Clouds: Failover Strategies for Cloud Computing Using Integrated Checkpointing Algorithms , 2012, 2012 International Conference on Communication Systems and Network Technologies.

[43]  Dongxia Wang,et al.  DAC‐Hmm: detecting anomaly in cloud systems with hidden Markov models , 2015, Concurr. Comput. Pract. Exp..

[44]  Gang Chen,et al.  A lightweight software fault‐tolerance system in the cloud environment , 2015, Concurr. Comput. Pract. Exp..

[45]  Heon Young Yeom,et al.  Node selection for a fault-tolerant streaming service on a peer-to-peer network , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[46]  Jasbir Kaur,et al.  Efficient Algorithm for Fault Tolerance in Cloud Computing , 2014 .

[47]  Guiran Chang,et al.  Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments , 2013, The Journal of Supercomputing.

[48]  Fabrice Huet,et al.  Adaptive Fault Tolerance in Real Time Cloud Computing , 2011, 2011 IEEE World Congress on Services.

[49]  Muhammad Ali Babar,et al.  Guidelines for Building a Private Cloud Infrastructure , 2012 .

[50]  Albert Y. Zomaya,et al.  Fault Tolerance in the Cloud , 2016 .

[51]  Arobinda Gupta,et al.  Performance comparison of checkpoint and recovery protocols , 2003, Concurr. Comput. Pract. Exp..

[52]  Raymond H. Putra,et al.  Dependable virtual machine allocation , 2013, 2013 Proceedings IEEE INFOCOM.

[53]  Elliot K. Kolodner,et al.  Guaranteeing High Availability Goals for Virtual Machine Placement , 2011, 2011 31st International Conference on Distributed Computing Systems.

[54]  V. Piuri,et al.  Fault tolerance management in IaaS clouds , 2012, 2012 IEEE First AESS European Conference on Satellite Telecommunications (ESTEL).

[55]  Gaurav Raj,et al.  A novel high adaptive fault tolerance model in real time cloud computing , 2014, 2014 5th International Conference - Confluence The Next Generation Information Technology Summit (Confluence).

[56]  Hai Jin,et al.  CDMCR: multi-level fault-tolerant system for distributed applications in cloud , 2016, Secur. Commun. Networks.

[57]  Rajkumar Buyya,et al.  Resource provisioning based on preempting virtual machines in distributed systems , 2014, Concurr. Comput. Pract. Exp..

[58]  Amal Ganesh,et al.  A study on fault tolerance methods in Cloud Computing , 2014, 2014 IEEE International Advance Computing Conference (IACC).

[59]  Rodrigo Nogueira,et al.  CloudBFT: Elastic Byzantine Fault Tolerance , 2014, 2014 IEEE 20th Pacific Rim International Symposium on Dependable Computing.