Self-adaptive cloud monitoring with online anomaly detection

Monitoring is the key to guarantee the reliability of cloud computing systems. By analyzing monitoring data, administrators can understand systems statuses to detect, diagnose and solve problems. However, due to the enormous scale and complex structure of cloud computing, a monitoring system should collect, transfer, store and process a large amount of monitoring data, which brings a significant performance overhead and increases the difficulty of analyzing useful information. To address the above issue, this paper proposes a self-adaptive monitoring approach for cloud computing systems. First, we conduct correlation analysis between different metrics, and monitor selected important ones which represent the others and reflect the running status of a system. Second, we characterize the running status with Principal Component Analysis (PCA), estimate the anomaly degree, and predict the possibility of faults. Finally, we dynamically adjust the monitoring period based on the estimated anomaly degree and a reliability model. To evaluate our proposal, we have applied the approach in our open-source TPC-W benchmark Bench4Q deployed in our real cloud computing platform OnceCloud. The experimental results demonstrate that our approach can adapt to dynamic workloads, accurately estimate the anomaly degree, and automatically adjust monitoring periods. Thus, the approach can effectively improve the accuracy and timeliness of anomaly detection in an abnormal status, and efficiently lower the monitoring overhead in a normal status. Correlation analysis is proposed to select key metrics representing others.PCA is proposed to characterize running status and predict the possibility of faults.We dynamically adjust metrics and periods based on a reliability model.We evaluate the approach on our real cloud platform with case studies.

[1]  Salvatore Venticinque,et al.  Cloud Application Monitoring: The mOSAIC Approach , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[2]  Julie A. McCann,et al.  A survey of autonomic computing—degrees, models, and applications , 2008, CSUR.

[3]  Kim-Kwang Raymond Choo,et al.  On cloud security attacks: A taxonomy and intrusion detection and prevention as a service , 2016, J. Netw. Comput. Appl..

[4]  Cristina Nita-Rotaru,et al.  A survey of attack and defense techniques for reputation systems , 2009, CSUR.

[5]  Zhiling Lan,et al.  Toward Automated Anomaly Identification in Large-Scale Systems , 2010, IEEE Transactions on Parallel and Distributed Systems.

[6]  Claudio Gutierrez,et al.  Survey of graph database models , 2008, CSUR.

[7]  Katerina Goseva-Popstojanova,et al.  Session Reliability of Web Systems under Heavy-Tailed Workloads: An Approach Based on Design and Analysis of Experiments , 2013, IEEE Transactions on Software Engineering.

[8]  Xiong Luo,et al.  A kernel machine-based secure data sensing and fusion scheme in wireless sensor networks for the cyber-physical systems , 2016, Future Gener. Comput. Syst..

[9]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[10]  Tadashi Dohi,et al.  Metrics-Based Software Reliability Models Using Non-homogeneous Poisson Processes , 2006, 2006 17th International Symposium on Software Reliability Engineering.

[11]  Tao Wang,et al.  Workload-aware anomaly detection for Web applications , 2014, J. Syst. Softw..

[12]  Jianfeng Zhan,et al.  LogMaster: Mining Event Correlations in Logs of Large-Scale Cluster Systems , 2010, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[13]  Paul A. S. Ward,et al.  A comparative study of pairwise regression techniques for problem determination , 2007, CASCON.

[14]  Michele Colajanni,et al.  A Scalable Architecture for Real-Time Monitoring of Large Information Systems , 2012, 2012 Second Symposium on Network Cloud Computing and Applications.

[15]  Jong-Won Park,et al.  A RESTful Approach to the Management of Cloud Infrastructure , 2009, 2009 IEEE International Conference on Cloud Computing.

[16]  He Huang,et al.  P&P: A Combined Push-Pull Model for Resource Monitoring in Cloud Computing Environment , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[17]  Felix Salfner,et al.  Cross-core event monitoring for processor failure prediction , 2009, 2009 International Conference on High Performance Computing & Simulation.

[18]  David A. Patterson,et al.  A Simple Way to Estimate the Cost of Downtime , 2002, LISA.

[19]  Pushpraj Shukla,et al.  Efficient Constraint Monitoring Using Adaptive Thresholds , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[20]  Keke Gai,et al.  Privacy-Preserving Data Encryption Strategy for Big Data in Mobile Cloud Computing , 2017, IEEE Transactions on Big Data.

[21]  Shicong Meng,et al.  Enhanced Monitoring-as-a-Service for Effective Cloud Management , 2013, IEEE Transactions on Computers.

[22]  J. Evans Straightforward Statistics for the Behavioral Sciences , 1995 .

[23]  Jin Shao,et al.  A Runtime Model Based Monitoring Approach for Cloud , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[24]  Danny Raz,et al.  Efficient reactive monitoring , 2002, IEEE J. Sel. Areas Commun..

[25]  Cristian Estan,et al.  New directions in traffic measurement and accounting , 2001, IMW '01.

[26]  Alexander Aiken,et al.  Using correlated surprise to infer shared influence , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[27]  Jose M. Alcaraz Calero,et al.  Elastic monitoring framework for cloud infrastructures , 2012, IET Commun..

[28]  Xiong Luo,et al.  Online Optimization of Collaborative Web Service QoS Prediction Based on Approximate Dynamic Programming , 2014, 2014 International Conference on Identification, Information and Knowledge in the Internet of Things.

[29]  Assaf Schuster,et al.  A geometric approach to monitoring threshold functions over distributed data streams , 2006, Ubiquitous Knowledge Discovery.

[30]  Martin Arlitt,et al.  Web Workload Characterization: Ten Years Later , 2005 .

[31]  Santanu S. Dey,et al.  Sparse principal component analysis and its $l_1$-relaxation , 2017, 1712.00800.

[32]  Soila Pertet,et al.  Causes of Failure in Web Applications (CMU-PDL-05-109) , 2005 .

[33]  Kim-Kwang Raymond Choo,et al.  Web application protection techniques: A taxonomy , 2016, J. Netw. Comput. Appl..

[34]  Zhou Wei,et al.  CloudTPS: Scalable Transactions for Web Applications in the Cloud , 2012, IEEE Trans. Serv. Comput..

[35]  Jun Wei,et al.  FD4C: Automatic Fault Diagnosis Framework for Web Applications in Cloud Computing , 2016, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[36]  Jose M. Alcaraz Calero,et al.  MonPaaS: An Adaptive Monitoring Platformas a Service for Cloud Computing Infrastructures and Services , 2015, IEEE Trans. Serv. Comput..

[37]  Min Chen,et al.  SA-EAST , 2017, ACM Trans. Embed. Comput. Syst..

[38]  Maurice Herlihy,et al.  Edge-TM , 2017, ACM Trans. Embed. Comput. Syst..

[39]  Haifeng Chen,et al.  Ranking the importance of alerts for problem determination in large computer systems , 2009, ICAC '09.

[40]  Haifeng Chen,et al.  PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems , 2010, ICAC '10.

[41]  Song Fu,et al.  Anomaly detection in large-scale coalition clusters for dependability assurance , 2010, 2010 International Conference on High Performance Computing.

[42]  Keke Gai,et al.  Spoofing-Jamming Attack Strategy Using Optimal Power Distributions in Wireless Smart Grid Networks , 2017, IEEE Transactions on Smart Grid.

[43]  Haifeng Chen,et al.  Modeling and Tracking of Transaction Flow Dynamics for Fault Detection in Complex Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[44]  Evgenia Smirni,et al.  Model-Driven System Capacity Planning under Workload Burstiness , 2010, IEEE Transactions on Computers.

[45]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[46]  Hai Jin,et al.  VMDriver: A Driver-Based Monitoring Mechanism for Virtualization , 2010, 2010 29th IEEE Symposium on Reliable Distributed Systems.