High-Availability Computing Platform with Sensor Fault Resilience

Modern computing platforms usually use multiple sensors to report system information. In order to achieve high availability (HA) for the platform, the sensors can be used to efficiently detect system faults that make a cloud service not live. However, a sensor may fail and disable HA protection. In this case, human intervention is needed, either to change the original fault model or to fix the sensor fault. Therefore, this study proposes an HA mechanism that can continuously provide HA to a cloud system based on dynamic fault model reconstruction. We have implemented the proposed HA mechanism on a four-layer OpenStack cloud system and tested the performance of the proposed mechanism for all possible sets of sensor faults. For each fault model, we inject possible system faults and measure the average fault detection time. The experimental result shows that the proposed mechanism can accurately detect and recover an injected system fault with disabled sensors. In addition, the system fault detection time increases as the number of sensor faults increases, until the HA mechanism is degraded to a one-system-fault model, which is the worst case as the system layer heartbeating.

[1]  Steven X. Ding,et al.  A Survey of Fault Diagnosis and Fault-Tolerant Techniques—Part I: Fault Diagnosis With Model-Based and Signal-Based Approaches , 2015, IEEE Transactions on Industrial Electronics.

[2]  Wei-Jen Wang,et al.  Supporting software-defined HA clusters on OpenStack platform , 2017, 2017 International Conference on Applied System Innovation (ICASI).

[3]  A. Singh,et al.  Fault-tolerant systems , 1990, Computer.

[4]  Maria Toeroe,et al.  Comparing Pacemaker with OpenSAF for Availability Management in the Cloud , 2017, 2017 IEEE International Conference on Edge Computing (EDGE).

[5]  Judith Kelner,et al.  High availability in clouds: systematic review and research challenges , 2016, Journal of Cloud Computing.

[6]  Dominik Füssel,et al.  Supervision, Fault-Detection and Fault-Diagnosis Methods , 1999 .

[7]  Luiz Fernando Bittencourt,et al.  Towards a Multi-Tier Fog/Cloud Architecture for Video Streaming , 2018, 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC Companion).

[8]  Maria Toeroe,et al.  Availability in the cloud: State of the art , 2016, J. Netw. Comput. Appl..

[9]  Charng-da Lu,et al.  Scalable Diskless Checkpointing for Large Parallel Systems , 2005 .

[10]  Deron Liang,et al.  Virtual machines of high availability using hardware-assisted failure detection , 2015, 2015 International Carnahan Conference on Security Technology (ICCST).

[11]  Feng Zhao,et al.  Monitoring and fault diagnosis of hybrid systems , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[12]  Mukesh Singhal,et al.  Joint Failure Recovery, Fault Prevention, and Energy-efficient Resource Management for Real-time SFC in Fog-supported SDN , 2018, Comput. Networks.

[13]  Xiaodong Zhang,et al.  Robust Fault Diagnosis of Aircraft Engines: A Nonlinear Adaptive Estimation-Based Approach , 2013, IEEE Transactions on Control Systems Technology.

[14]  Tahmid Hasan,et al.  Using Adaptive Heartbeat Rate on Long-Lived TCP Connections , 2018, IEEE/ACM Transactions on Networking.

[15]  Amir Masoud Rahmani,et al.  Reliability and high availability in cloud computing environments: a reference roadmap , 2018, Human-centric Computing and Information Sciences.

[16]  Guangwen Yang,et al.  An adaptive task-level fault-tolerant approach to Grid , 2009, The Journal of Supercomputing.

[17]  Steven X. Ding,et al.  A Survey of Fault Diagnosis and Fault-Tolerant Techniques—Part II: Fault Diagnosis With Knowledge-Based and Hybrid/Active Approaches , 2015, IEEE Transactions on Industrial Electronics.

[18]  Danwei Wang,et al.  Model-Based Health Monitoring for a Vehicle Steering System With Multiple Faults of Unknown Types , 2014, IEEE Transactions on Industrial Electronics.

[19]  Deron Liang,et al.  NCU-HA: A Lightweight HA System for Kernel-Based Virtual Machine , 2017, 2017 International Conference on Platform Technology and Service (PlatCon).

[20]  W. David Ashley Foundations of Libvirt Development , 2019, Apress.

[21]  Jun Wei,et al.  Fault detection for cloud computing systems with correlation analysis , 2015, 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM).

[22]  Marios D. Dikaiakos,et al.  Monitoring Elastically Adaptive Multi-Cloud Services , 2018, IEEE Transactions on Cloud Computing.

[23]  Laurent Broto,et al.  Approaches to cloud computing fault tolerance , 2012, 2012 International Conference on Computer, Information and Telecommunication Systems (CITS).

[24]  Guisheng Fan,et al.  Model Based Byzantine Fault Detection Technique for Cloud Computing , 2012, 2012 IEEE Asia-Pacific Services Computing Conference.

[25]  Plamen P. Angelov,et al.  An evolving approach to unsupervised and Real-Time fault detection in industrial processes , 2016, Expert Syst. Appl..

[26]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.