The design of a fault management framework for cloud

High performance computing systems can have high failure rates as they feature a large number of servers and components with intensive workload. The availability of the system can be easily compromised if the failure of these subsystems is not handled correctly. This research proposes a framework of proactive fault tolerance for enterprise cloud computing systems. The main idea is to create an effective prediction model focusing on hardware failure. The proposed framework features two major components: monitoring and availability analysis. For each machine, the availability analysis module tracks historical states, and predicts the machine future state. Depending on the predicted state, the resource manager decides whether the machine requires task migration to prevent possible losses. By using task migration, the framework eliminates the cost of job replication and back up. The framework also includes the adequacy checking function into availability analysis in order to periodically evaluate and adjust the prediction model. The framework can thus be adopted by heterogeneous datacenters. The energy efficiency can be improved as the impact of the failure to the datacenters reduces.

[1]  Pradeep Lall,et al.  Time–Frequency and Autoregressive Techniques for Prognostication of Shock-Impact Reliability of Implantable Biological Electronic Systems , 2008, IEEE Transactions on Electronics Packaging Manufacturing.

[2]  Yaohang Li,et al.  Improving performance via computational replication on a large-scale computational grid , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[3]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[4]  Takehisa Kohda,et al.  Fault-tree analysis considering latency of basic events , 2001, Annual Reliability and Maintainability Symposium. 2001 Proceedings. International Symposium on Product Quality and Integrity (Cat. No.01CH37179).

[5]  Michele Colajanni,et al.  Short-term prediction models for server management in Internet-based contexts , 2009, Decis. Support Syst..

[6]  Sang-Min Park,et al.  Predictable High-Performance Computing Using Feedback Control and Admission Control , 2011, IEEE Transactions on Parallel and Distributed Systems.

[7]  Christian Engelmann,et al.  A Proactive Fault Tolerance Framework for High-Performance Computing , 2010 .

[8]  Vittoria Gianuzzi Data replication effectiveness in mobile ad-hoc networks , 2004, PE-WASUN '04.

[9]  Stephen L. Scott,et al.  Evaluation of fault-tolerant policies using simulation , 2007, 2007 IEEE International Conference on Cluster Computing.

[10]  Peter A. Dinda,et al.  An evaluation of linear models for host load prediction , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[11]  Christian Engelmann,et al.  Proactive Fault Tolerance Using Preemptive Migration , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[12]  Cheng-Zhong Xu,et al.  Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[13]  Dhiraj K. Pradhan,et al.  Design Techniques for Bit-Parallel Galois Field Multipliers with On-Line Single Error Correction and Double Error Detection , 2008, 2008 14th IEEE International On-Line Testing Symposium.

[14]  Andrew S. Tanenbaum,et al.  Distributed systems: Principles and Paradigms , 2001 .