Reliable Management of Virtualized Resources Using Fault Trees

The new trends in distributed computing has changed the way we do computing when talking about cloud infrastructures or high-performance computing. Resource virtualization technologies enabled elasticity of resource provisioning and management through easy replication of virtual nodes or virtual machine migration. In order to provide high availability and reliability in such distributed environments where resources are managed and served in form of virtual machines, specific load balancing and fault strategies are needed. Based on fault tree analysis concepts, we propose a distributed and autonomous approach to manage faults using fault agents able to asses and predict for each virtualized node, its state of fault or future fault. Accordingly, each node can take a decision about accepting future jobs, delegate jobs to own replicated instances or start a live migration process as a second strategy for assuring availability and continuity of the service.

[1]  Rachid Guerraoui,et al.  Independent faults in the cloud , 2010, LADIS '10.

[2]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[3]  Vincenzo Piuri,et al.  Chapter 1 – Fault Tolerance and Resilience in Cloud Computing Environments , 2014 .

[4]  Zhoujun Li,et al.  Adaptive Management of Virtualized Resources in Cloud Computing Using Feedback Control , 2009, 2009 First International Conference on Information Science and Engineering.

[5]  Jin B. Hong,et al.  Availability Modeling and Analysis of a Virtualized System , 2009, 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing.

[6]  Yacov Y. Haimes,et al.  Risk modeling, assessment, and management , 1998 .

[7]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[8]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[9]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[10]  Muhammad Atif,et al.  Adaptive parallel application resource remapping through the live migration of virtual machines , 2014, Future Gener. Comput. Syst..

[11]  Christine Morin,et al.  Snooze: A Scalable and Autonomic Virtual Machine Management Framework for Private Clouds , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[12]  Poul E. Heegaard,et al.  Differentiated Availability in Cloud Computing SLAs , 2011, 2011 IEEE/ACM 12th International Conference on Grid Computing.

[13]  V. Piuri,et al.  Fault tolerance management in IaaS clouds , 2012, 2012 IEEE First AESS European Conference on Satellite Telecommunications (ESTEL).

[14]  Laurent Lefèvre,et al.  Fault tolerance for highly available internet services: concepts, approaches, and issues , 2008, IEEE Communications Surveys & Tutorials.

[15]  Rajkumar Buyya,et al.  CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms , 2011, Softw. Pract. Exp..