Towards Proactive Fault Management of Enterprise Systems

This paper introduces a model-based approach for autonomic fault management of computing systems. The proposed approach can recover a system from common faults while minimizing the impact on the system's quality of service and reducing potential revenue loss. When faults occur, the approach identifies fault types and accordingly compute the optimal recovery action with minimum impact on performance and operating cost using a predictive control algorithm. The paper introduces the formal settings of the model-based fault management approach and the underlying predictive control algorithm. The fault management approach has been verified on a testbed with respect to simulated faults including memory leak and network congestion. Simulation results show that our approach enabled effective automatic recovery from these faults with minimum impacts of system performance.

[1]  Zhiling Lan,et al.  Adaptive Fault Management of Parallel Applications for High-Performance Computing , 2008, IEEE Transactions on Computers.

[2]  Jorge-Arnulfo Quiané-Ruiz,et al.  RAFTing MapReduce: Fast recovery on the RAFT , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[3]  Gabor Karsai,et al.  Application of software health management techniques , 2011, SEAMS '11.

[4]  Transparent Fault Tolerance of Device Drivers for Virtual Machines , 2010, IEEE Transactions on Computers.

[5]  José A. B. Fortes,et al.  Fault Management in Map-Reduce Through Early Detection of Anomalous Nodes , 2013, ICAC.

[6]  Vincenzo Grassi,et al.  The KlaperSuite framework for model-driven reliability analysis of component-based systems , 2014, Software & Systems Modeling.

[7]  Christian Engelmann,et al.  Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[8]  Aurelien Bouteiller,et al.  Fault Tolerance Management for a Hierarchical GridRPC Middleware , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[9]  Alessandra Gorla,et al.  Automatic recovery from runtime failures , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[10]  Zhiling Lan,et al.  Toward Automated Anomaly Identification in Large-Scale Systems , 2010, IEEE Transactions on Parallel and Distributed Systems.

[11]  José A. B. Fortes,et al.  Towards self-caring mapreduce: Proactively reducing fault-induced execution-time penalties , 2011, 2011 International Conference on High Performance Computing & Simulation.

[12]  Xiaohui Gu,et al.  UBL: unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems , 2012, ICAC '12.

[13]  Zhi-Li Zhang,et al.  Co-designing the failure analysis and monitoring of large-scale systems , 2008, PERV.

[14]  Pan Pan,et al.  Dynamic Workflow Management and Monitoring Using DDS , 2010, 2010 Seventh IEEE International Conference and Workshops on Engineering of Autonomic and Autonomous Systems.

[15]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[16]  Dirk Beyer,et al.  Designing for Disasters , 2004, FAST.

[17]  Franck Cappello,et al.  Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[18]  Qing Li,et al.  FACTS: A Framework for Fault-Tolerant Composition of Transactional Web Services , 2010, IEEE Transactions on Services Computing.

[19]  Michael D. Bond,et al.  Tolerating memory leaks , 2008, OOPSLA.

[20]  Ricardo J. Rodríguez,et al.  Fault-tolerant techniques and security mechanisms for model-based performance prediction of critical systems , 2012, ISARCS '12.

[21]  Ching-Hsien Hsu,et al.  On improvement of cloud virtual machine availability with virtualization fault tolerance mechanism , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[22]  Indranil Gupta,et al.  Making cloud intermediate data fault-tolerant , 2010, SoCC '10.

[23]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[24]  Jack Y. B. Lee Supporting server-level fault tolerance in concurrent-push-based parallel video servers , 2001, IEEE Trans. Circuits Syst. Video Technol..

[25]  Andrew S. Tanenbaum,et al.  Dealing with Driver Failures in the Storage Stack , 2009, 2009 Fourth Latin-American Symposium on Dependable Computing.

[26]  Zizhong Chen,et al.  Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing , 2009, IEEE Transactions on Computers.

[27]  Jeffrey Dean,et al.  Designs, Lessons and Advice from Building Large Distributed Systems , 2009 .

[28]  Zhiling Lan,et al.  3-Dimensional root cause diagnosis via co-analysis , 2012, ICAC '12.

[29]  Nagarajan Kandasamy,et al.  On the application of predictive control techniques for adaptive performance management of computing systems , 2009, IEEE Transactions on Network and Service Management.

[30]  Jing Deng,et al.  Fault-tolerant and reliable computation in cloud computing , 2010, 2010 IEEE Globecom Workshops.