Framework for Enabling Highly Available Distributed Applications for Utility Computing

The move towards IT outsourcing is the first step towards an environment where compute infrastructure is treated as a service. In utility computing this IT service has to honor Service Level Agreements (SLA) in order to meet the desired Quality of Service (QoS) guarantees. Such an environment requires reliable services in order to maximize the utilization of the resources and to decrease the Total Cost of Ownership (TCO). Such reliability cannot come at the cost of resource duplication, since it increases the TCO of the data center and hence the cost per compute unit. We, in this paper, look into aspects of projecting impact of hardware failures on the SLAs and techniques required to take proactive recovery steps in case of a predicted failure. By maintaining health vectors of all hardware and system resources, we predict the failure probability of resources based on observed hardware errors/failure events, at runtime. This inturn influences an availability aware middleware to take proactive action (even before the application is affected in case the system and the application have low recoverability). The proposed framework has been prototyped on a system running HP-UX. Our offline analysis of the prediction system on hardware error logs indicate no more than 10% false positives. This work to the best of our knowledge is the first of its kind to perform an end-to-end analysis of the impact of a hardware fault on application SLAs, in a live system.

[1]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .

[2]  S. K. Nandy,et al.  A Framework for QoS Adaptive Grid Meta Scheduling , 2005, 16th International Workshop on Database and Expert Systems Applications (DEXA'05).

[3]  William J. Dally,et al.  Fault Tolerance Techniques for the Merrimac Streaming Supercomputer , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[4]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[5]  Matti A. Hiltunen,et al.  Fault-tolerant grid services using primary-backup: feasibility and performance , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[6]  Akhil Sahai,et al.  Automated policy-based resource construction in utility computing environments , 2004, 2004 IEEE/IFIP Network Operations and Management Symposium (IEEE Cat. No.04CH37507).

[7]  Raja Chatila,et al.  On Fault Tolerance and Robustness in Autonomous Systems , 2004 .

[8]  Andrea C. Arpaci-Dusseau,et al.  IRON file systems , 2005, SOSP '05.

[9]  Klara Nahrstedt,et al.  A distributed resource management architecture that supports advance reservations and co-allocation , 1999, 1999 Seventh International Workshop on Quality of Service. IWQoS'99. (Cat. No.98EX354).

[10]  Jeanne W. Ross,et al.  Preparing for utility computing: The role of IT architecture and relationship management , 2004, IBM Syst. J..

[11]  Soonwook Hwang,et al.  Grid workflow: a flexible failure handling framework for the grid , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.