Fault-Tree-Based Service Availability Model in Cloud Environments: A Failure Trace Archive Approach

In a cloud computing environment with capabilities such as live migration and elastic resource provisioning, with a mandatory request for critical availability of the service, our challenge consists in how to use basic fault tree analysis for assessing the health state of a node/service instance and perform load balancing in an autonomous manner. We propose a model that extracts event abstraction from the run-time logs, aiming to assess whether the primary service instance or its replica is reliable or unreliable. We employ replication or live migration processes to keep the service availability at an acceptable level. The model is a probabilistic one and is validated using the LANL HPC Failure Trace Archive (FTA) data set.

[1]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[2]  Christine Morin,et al.  Snooze: A Scalable and Autonomic Virtual Machine Management Framework for Private Clouds , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[3]  Poul E. Heegaard,et al.  Differentiated Availability in Cloud Computing SLAs , 2011, 2011 IEEE/ACM 12th International Conference on Grid Computing.

[4]  Rachid Guerraoui,et al.  Independent faults in the cloud , 2010, LADIS '10.

[5]  Alexandru Iosup,et al.  The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems , 2013, J. Parallel Distributed Comput..

[6]  Alexandru Stan,et al.  Autonomous Management of Virtual Machine Failures in IaaS Using Fault Tree Analysis , 2014, GECON.

[7]  John D. Andrews,et al.  Choosing a heuristic for the "fault tree to binary decision diagram" conversion, using neural networks , 2002, IEEE Trans. Reliab..

[8]  Jianwen Xiang,et al.  Automatic Synthesis of Static Fault Trees from System Models , 2011, 2011 Fifth International Conference on Secure Software Integration and Reliability Improvement.

[9]  Rajkumar Buyya,et al.  Reliable Provisioning of Spot Instances for Compute-intensive Applications , 2011, 2012 IEEE 26th International Conference on Advanced Information Networking and Applications.

[10]  Alexandru Iosup,et al.  The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[11]  V. Piuri,et al.  Fault tolerance management in IaaS clouds , 2012, 2012 IEEE First AESS European Conference on Satellite Telecommunications (ESTEL).

[12]  Alexandru Stan,et al.  Reliable Management of Virtualized Resources Using Fault Trees , 2014, 2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.

[13]  Hadi Aghassi,et al.  A speed-up algorithm in Monte Carlo simulation for fault tree analysis with GPU computing , 2011, 2011 International Conference of Soft Computing and Pattern Recognition (SoCPaR).