Impact of Failure Prediction on Availability: Modeling and Comparative Analysis of Predictive and Reactive Methods

Predicting failures and acting proactively have a potential to improve availability as a correct prediction and a successful mitigation may bring a reward resulting in decrease of downtime and availability improvement. But, conversely, each incorrect prediction may introduce additional downtime (penalty). Therefore, depending on the quality of prediction and the system parameters, predictive fault-tolerance methods may improve or may degrade availability in comparison to the reactive ones. We first derive taxonomies of fault-tolerant techniques and policies to differentiate between reactive and proactive policies that are further classified as systematic and predictive. To evaluate whether a predictive policy improves availability or not, we derive an analytical model for availability quantification. We use Markov chains to extend steady-state availability equation to include: precision and recall, penalty and reward, mitigation success probability and potential failure rate increase due to the prediction load. We also derive an A-measure to optimize failure prediction for increasing availability. In our conclusion, precision and recall have comparable impact on availability as changing MTTF and MTTR. To validate the model we also simulate and analyze availability of a virtualized server with exponential distribution of failure and repair rates.

[1]  Xubin He,et al.  Failure Prediction Models for Proactive Fault Tolerance within Storage Systems , 2008, 2008 IEEE International Symposium on Modeling, Analysis and Simulation of Computers and Telecommunication Systems.

[2]  Bran Selic,et al.  A Proactive Fault Tolerance Approach to High Performance Computing (HPC) in the Cloud , 2012, 2012 Second International Conference on Cloud and Green Computing.

[3]  Christian Engelmann,et al.  Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[4]  Wenbing Zhao,et al.  Proactive Service Migration for Long-Running Byzantine Fault Tolerant Systems , 2008, IET Softw..

[5]  Kishor S. Trivedi,et al.  SHARPE at the age of twenty two , 2009, PERV.

[6]  Ming Mao,et al.  A Performance Study on the VM Startup Time in the Cloud , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[7]  Miroslaw Malek,et al.  Proactive fault handling for system availability enhancement , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[8]  Dong Seong Kim,et al.  Sensitivity Analysis of Server Virtualized System Availability , 2012, IEEE Transactions on Reliability.

[9]  Ikhwan Lee,et al.  Survey of Error and Fault Detection Mechanisms , 2011 .

[10]  Juan Manuel García,et al.  A survey of migration mechanisms of virtual machines , 2014, CSUR.

[11]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[12]  Felix Salfner,et al.  Dependable Estimation of Downtime for Virtual Machine Live Migration , 2012 .

[13]  Jordi Torres,et al.  Adaptive on-line software aging prediction based on machine learning , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[14]  Nitin H. Vaidya,et al.  Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme , 1997, IEEE Trans. Computers.

[15]  Ravishankar K. Iyer,et al.  Measurement and modeling of computer reliability as affected by system activity , 1986, TOCS.

[16]  Chokchai Leangsuksun,et al.  Proficiency Metrics for Failure Prediction in High Performance Computing , 2010, International Symposium on Parallel and Distributed Processing with Applications.

[17]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[18]  Miroslaw Malek,et al.  Call Availability Prediction in a Telecommunication System: A Data Driven Empirical Approach , 2006, 2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06).

[19]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[20]  Stephen L. Scott,et al.  Evaluation of fault-tolerant policies using simulation , 2007, 2007 IEEE International Conference on Cluster Computing.

[21]  Christian Engelmann,et al.  A Framework for Proactive Fault Tolerance , 2008, 2008 Third International Conference on Availability, Reliability and Security.

[22]  Miroslaw Malek,et al.  Optimizing Failure Prediction to Maximize Availability , 2016, 2016 IEEE International Conference on Autonomic Computing (ICAC).

[23]  Paulo Romero Martins Maciel,et al.  Availability study on cloud computing environments: Live migration as a rejuvenation mechanism , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[24]  David M. Nicol,et al.  Fluid stochastic Petri nets: Theory, applications, and solution techniques , 1998, Eur. J. Oper. Res..

[25]  Yves Robert,et al.  Checkpointing Strategies with Prediction Windows , 2013, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing.

[26]  Felix Salfner,et al.  Timely Virtual Machine Migration for Pro-active Fault Tolerance , 2011, 2011 14th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing Workshops.

[27]  W. Kent Fuchs,et al.  An adaptive checkpointing protocol to bound recovery time with message logging , 1999, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems.

[28]  Franck Cappello,et al.  Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[29]  Zhiling Lan,et al.  Adaptive Fault Management of Parallel Applications for High-Performance Computing , 2008, IEEE Transactions on Computers.

[30]  Jack J. Dongarra,et al.  Exascale computing and big data , 2015, Commun. ACM.