Chapter 7 – Fault Tolerance and Resilience in Cloud Computing Environments

The increasing demand for flexibility and scalability in dynamically obtaining and releasing computing resources in a cost-effective and device-independent manner, and easiness in hosting applications without the burden of installation and maintenance, has resulted in a wide adoption of the cloud computing paradigm. While the benefits are immense, this computing paradigm is still vulnerable to a large number of system failures; as a consequence, users have become increasingly concerned about the reliability and availability of cloud computing services. Fault tolerance and resilience serve as an effective means to address users’ reliability and availability concerns. In this chapter, we focus on characterizing the recurrent failures in a typical cloud computing environment, analyzing the effects of failures on users’ applications and surveying fault tolerance solutions corresponding to each class of failures. We also discuss the perspective of offering fault tolerance as a service to users’ applications as one of the effective means of addressing users’ reliability and availability concerns.

[1]  James J. Filliben,et al.  Comparing VM-Placement Algorithms for On-Demand Clouds , 2011, CloudCom.

[2]  Jacobus E. van der Merwe,et al.  Cloud Resource Orchestration: A Data-Centric Approach , 2011, CIDR.

[3]  Christine Morin,et al.  Snooze: A Scalable and Autonomic Virtual Machine Management Framework for Private Clouds , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[4]  Poul E. Heegaard,et al.  Differentiated Availability in Cloud Computing SLAs , 2011, 2011 IEEE/ACM 12th International Conference on Grid Computing.

[5]  Arun Venkataramani,et al.  Separating agreement from execution for byzantine fault tolerant services , 2003, SOSP '03.

[6]  Jin B. Hong,et al.  Availability Modeling and Analysis of a Virtualized System , 2009, 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing.

[7]  Vincenzo Piuri,et al.  Supporting Security Requirements for Resource Management in Cloud Computing , 2012, 2012 IEEE 15th International Conference on Computational Science and Engineering.

[8]  Arun Venkataramani,et al.  ZZ and the art of practical BFT execution , 2011, EuroSys '11.

[9]  Vincenzo Piuri,et al.  Fault Tolerance Management in Cloud Computing: A System-Level Perspective , 2013, IEEE Systems Journal.

[10]  Kishor S. Trivedi,et al.  Availability analysis of blade server systems , 2008, IBM Syst. J..

[11]  Vincenzo Piuri,et al.  Dependability certification of services: a model-based approach , 2013, Computing.

[12]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[13]  Laurent Lefèvre,et al.  Fault tolerance for highly available internet services: concepts, approaches, and issues , 2008, IEEE Communications Surveys & Tutorials.

[14]  Sabrina De Capitani di Vimercati,et al.  Data protection in outsourcing scenarios: issues and directions , 2010, ASIACCS '10.

[15]  Laurent Broto,et al.  Approaches to cloud computing fault tolerance , 2012, 2012 International Conference on Computer, Information and Telecommunication Systems (CITS).

[16]  Sushil Jajodia,et al.  Encryption-Based Policy Enforcement for Cloud Storage , 2010, 2010 IEEE 30th International Conference on Distributed Computing Systems Workshops.

[17]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[18]  Ramakrishna Kotla,et al.  Zyzzyva: speculative byzantine fault tolerance , 2007, TOCS.

[19]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[20]  Fabien Hermenier,et al.  Dynamic Consolidation of Highly Available Web Applications , 2011 .

[21]  Guilherme Piegas Koslovski,et al.  Reliablility Support in Virtual Infrastructures , 2019 .

[22]  Francisco Curbera,et al.  Web Services Business Process Execution Language Version 2.0 , 2007 .

[23]  Matti A. Hiltunen,et al.  An approach to constructing modular fault-tolerant protocols , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[24]  Vincenzo Piuri,et al.  Adaptive resource management for balancing availability and performance in cloud computing , 2013, 2013 International Conference on Security and Cryptography (SECRYPT).

[25]  Sabrina De Capitani di Vimercati,et al.  Managing and accessing data in the cloud: Privacy risks and approaches , 2012, 2012 7th International Conference on Risks and Security of Internet and Systems (CRiSIS).

[26]  Rachid Guerraoui,et al.  Independent faults in the cloud , 2010, LADIS '10.

[27]  Vincenzo Piuri,et al.  Hill-climbing heuristics for optimal hardware dimensioning and software allocation in fault-tolerant distributed systems , 1989 .

[28]  Vincenzo Piuri,et al.  Dependability-Oriented Resource Management Schemes for Cloud Computing Data Centers , 2015, Handbook on Data Centers.

[29]  Abdelsalam Helal,et al.  Reliability, Availability, Dependability and Performability: A User-centered View , 1997 .

[30]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[31]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[32]  Ernesto Damiani,et al.  A model-based approach to reliability certification of services , 2012, 2012 6th IEEE International Conference on Digital Ecosystems and Technologies (DEST).

[33]  Navendu Jain,et al.  Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.

[34]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[35]  V. Piuri,et al.  Fault tolerance management in IaaS clouds , 2012, 2012 IEEE First AESS European Conference on Satellite Telecommunications (ESTEL).

[36]  Xavier Lorca,et al.  Bin Repacking Scheduling in Virtualized Datacenters , 2011, CP.

[37]  V. Piuri,et al.  A comprehensive conceptual system-level approach to fault tolerance in Cloud Computing , 2012, 2012 IEEE International Systems Conference SysCon 2012.

[38]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[39]  Louise E. Moser,et al.  Fault Tolerance Middleware for Cloud Computing , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[40]  Sushil Jajodia,et al.  Securing Mission-Centric Operations in the Cloud , 2014, Secure Cloud Computing.

[41]  Pierangela Samarati Data Security and Privacy in the Cloud , 2014, ISPEC.

[42]  Sushil Jajodia,et al.  Reliable mission deployment in vulnerable distributed systems , 2013, 2013 43rd Annual IEEE/IFIP Conference on Dependable Systems and Networks Workshop (DSN-W).

[43]  田村 芳明,et al.  Kemari: Virtual Machine Synchronization for Fault Tolerance , 2010 .

[44]  Vincenzo Piuri,et al.  Design of fault-tolerant distributed control systems , 1994 .

[45]  Fumio Machida,et al.  Redundant virtual machine placement for fault-tolerant consolidated server clusters , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.