A Study on Fault Tolerance Mechanisms in Cloud Computing

Cloud computing is widely popular due to its elasticity, economics, reliability and much more. Cloud computing offers a scalable service without any initial investment in servers, storages, or networks. Fault Tolerance (FT) is the ability of any system to continue performing its function regardless of any unexpected hardware or software failures. Fault Tolerance in Cloud Computing (FTCC) is an important area of research due to its complexity. However, there is a lack of studies in this field. Moreover, recent failures and availability issues in popular cloud providers demonstrates the need for more effective solutions. In this paper, we present a study on FTCC mechanisms and analyze its strength and weakness. Based on the study, a comparison on the main fault tolerance techniques is presented considering the cost, overhead, failure types, performance, and the tools used. Moreover, we study and compare the models that enhance the performance of checkpoint and replication based techniques.

[1]  Taieb Znati,et al.  Shadow Replication: An Energy-Aware, Fault-Tolerant Computational Model for Green Cloud Computing , 2014 .

[2]  Zibin Zheng,et al.  BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource Cloud Computing , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[3]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[4]  Jordi Torres,et al.  Checkpoint-based fault-tolerant infrastructure for virtualized service providers , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[5]  Valentin Cristea,et al.  Fault Tolerance and Recovery in Grid Workflow Management Systems , 2010, 2010 International Conference on Complex, Intelligent and Software Intensive Systems.

[6]  María S. Pérez-Hernández,et al.  Fault Tolerance in MapReduce: A Survey , 2016, Resource Management for Big Data Platforms.

[7]  Zizhong Chen,et al.  Multilevel Diskless Checkpointing , 2013, IEEE Transactions on Computers.

[8]  Asif Imran,et al.  Cloud-Niagara: A high availability and low overhead fault tolerance middleware for the cloud , 2014, 16th Int'l Conf. Computer and Information Technology.

[9]  Torki Altameem A Replication-Based and Fault Tolerant Allocation Algorithm for Cloud Computing , 2014 .

[10]  Gurpreet Singh,et al.  Fault Tolerance Techniques and Comparative Implementation in Cloud Computing , 2013 .

[11]  Ion Stoica,et al.  Failure as a Service (FaaS): A Cloud Service for Large- Scale, Online Failure Drills , 2011 .

[12]  Jasbir Kaur,et al.  Analysis of Different Techniques Used For Fault Tolerance , 2014 .

[13]  Kishor S. Trivedi,et al.  Software Rejuvenation in Eucalyptus Cloud Computing Infrastructure: A Method Based on Time Series Forecasting and Multiple Thresholds , 2011, 2011 IEEE Third International Workshop on Software Aging and Rejuvenation.

[14]  Jian Lin,et al.  VegaWarden: A Uniform User Management System for Cloud Applications , 2010, 2010 IEEE Fifth International Conference on Networking, Architecture, and Storage.

[15]  Manish Mahajan,et al.  Fault Tolerance in Cloud Computing , 2015 .

[16]  Geoffroy Vallée,et al.  Checkpoint/Restart of Virtual Machines Based on Xen , 2006 .

[17]  N. R. REJINPAUL,et al.  Checkpoint-based Intelligent Fault tolerance For Cloud Service Providers , 2012 .

[18]  David A. Maltz,et al.  Surviving failures in bandwidth-constrained datacenters , 2012, CCRV.

[19]  Cheng-Zhong Xu,et al.  Stochastic modeling and analysis of hybrid mobility in reconfigurable distributed virtual machines , 2006, J. Parallel Distributed Comput..

[20]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[21]  Asif Imran,et al.  A case-based framework for self-healing paralysed components in Distributed Software applications , 2014, The 8th International Conference on Software, Knowledge, Information Management and Applications (SKIMA 2014).

[22]  田村 芳明,et al.  Kemari: Virtual Machine Synchronization for Fault Tolerance , 2010 .

[23]  G. Malathy,et al.  Checkpoint-Based Fault Identification in Cloud Computing Tasks , 2012 .

[24]  Akshat Verma,et al.  pMapper: Power and Migration Cost Aware Application Placement in Virtualized Systems , 2008, Middleware.

[25]  V. Piuri,et al.  A comprehensive conceptual system-level approach to fault tolerance in Cloud Computing , 2012, 2012 IEEE International Systems Conference SysCon 2012.

[26]  Fabrice Huet,et al.  Adaptive Fault Tolerance in Real Time Cloud Computing , 2011, 2011 IEEE World Congress on Services.

[27]  Vincenzo Piuri,et al.  Fault Tolerance Management in Cloud Computing: A System-Level Perspective , 2013, IEEE Systems Journal.

[28]  Ramakrishna Kotla,et al.  Zyzzyva , 2007, SOSP.

[29]  Franck Cappello,et al.  BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[30]  Julia Myint,et al.  Management of Data Replication for PC Cluster-based Cloud Storage System , 2011, CloudCom 2011.

[31]  Dong Seong Kim,et al.  Candy: Component-based Availability Modeling Framework for Cloud Service Management Using SysML , 2011, 2011 IEEE 30th International Symposium on Reliable Distributed Systems.

[32]  Irfan-Ullah Awan,et al.  Optimising Fault Tolerance in Real-Time Cloud Computing IaaS Environment , 2016, 2016 IEEE 4th International Conference on Future Internet of Things and Cloud (FiCloud).

[33]  Paul D. Manuel,et al.  A hybrid fault tolerance technique in grid computing system , 2011, The Journal of Supercomputing.

[34]  Dan Feng,et al.  CDRM: A Cost-Effective Dynamic Replication Management Scheme for Cloud Storage Cluster , 2010, 2010 IEEE International Conference on Cluster Computing.

[35]  Frank Gens,et al.  Cloud Computing Benefits, risks and recommendations for information security , 2010 .

[36]  Dan Meng,et al.  Magicube: High Reliability and Low Redundancy Storage Architecture for Cloud Computing , 2012, 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage.

[37]  Surender Kumar,et al.  Hierarchical Non-blocking Coordinated Checkpointing Algorithms for Mobile Distributed Computing , 2010 .

[38]  Sachin Bagga,et al.  An autonomic approach for fault tolerance using scaling, replication and monitoring in cloud computing , 2015, 2015 IEEE 3rd International Conference on MOOCs, Innovation and Technology in Education (MITE).

[39]  Liuba Shrira,et al.  HQ replication: a hybrid quorum protocol for byzantine fault tolerance , 2006, OSDI '06.