Towards systems level prognostics in the Cloud

Many application systems are transforming from device centric architectures to cloud based systems that leverage shared compute resources to reduce cost and maximize reach. These systems require new paradigms to assure availability and quality of service. In this paper, we discuss the challenges in assuring Availability and Quality of Service in a Cloud Based Application System. We propose machine learning techniques for monitoring systems logs to assess the health of the system. A web services data set is employed to show that variety of services can be clustered to different service classes using a k-means clustering scheme. Reliability, Availability, and Serviceability (RAS) logs and Job logs dataset from high performance computing system is employed to show that impending fatal errors in the system can be predicted from the logs using an SVM classifier. These approaches illustrate the feasibility of methods to monitor the systems health and performance of compute resources and hence can be used to manage these systems for high availability and quality of service for critical tasks such as health care monitoring in the cloud.

[1]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[2]  Rajeev Thakur,et al.  A Fault Diagnosis and Prognosis Service for TeraGrid Clusters , 2007 .

[3]  Aleksey M. Urmanov,et al.  R-functions Based Classification for Abnormal Software Process Detection , 2005, CIS.

[4]  Eyhab Al-Masri,et al.  Investigating web services on the world wide web , 2008, WWW.

[5]  A. Urmanov Electronic Prognostics for Computer Servers , 2007, 2007 Annual Reliability and Maintainability Symposium.

[6]  Mark Schwabacher,et al.  A Survey of Data -Driven Prognostics , 2005 .

[7]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[8]  Dan Meng,et al.  Multidimensional Analysis of System Logs in Large-scale Cluster Systems , 2009, ArXiv.

[9]  Kai Goebel,et al.  A Survey of Artificial Intelligence for Prognostics , 2007, AAAI Fall Symposium: Artificial Intelligence for Prognostics.

[10]  Narayan Desai,et al.  Co-analysis of RAS Log and Job Log on Blue Gene/P , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[11]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[12]  Eyhab Al-Masri,et al.  QoS-based Discovery and Ranking of Web Services , 2007, 2007 16th International Conference on Computer Communications and Networks.

[13]  J.W. Hines,et al.  Prognostic algorithm categorization with PHM Challenge application , 2008, 2008 International Conference on Prognostics and Health Management.

[14]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[15]  Frank L. Lewis,et al.  Intelligent Fault Diagnosis and Prognosis for Engineering Systems , 2006 .