Metric selection and anomaly detection for cloud operations using log and metric correlation analysis

Abstract Cloud computing systems provide the facilities to make application services resilient against failures of individual computing resources. However, resiliency is typically limited by a cloud consumer’s use and operation of cloud resources. In particular, system operations have been reported as one of the leading causes of system-wide outages. This applies specifically to DevOps operations, such as backup, redeployment, upgrade, customized scaling, and migration – which are executed at much higher frequencies now than a decade ago. We address this problem by proposing a novel approach to detect errors in the execution of these kinds of operations, in particular for rolling upgrade operations. Our regression-based approach leverages the correlation between operations’ activity logs and the effect of operation activities on cloud resources. First, we present a metric selection approach based on regression analysis. Second, the output of a regression model of selected metrics is used to derive assertion specifications, which can be used for runtime verification of running operations. We have conducted a set of experiments with different configurations of an upgrade operation on Amazon Web Services, with and without randomly injected faults to demonstrate the utility of our new approach.

[1]  Thomas A. Limoncelli,et al.  The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems, Volume 2 , 2014 .

[2]  Liming Zhu,et al.  DevOps - A Software Architect's Perspective , 2015, SEI series in software engineering.

[3]  Ingo Weber,et al.  Experience report: Anomaly detection of cloud application operations using log and cloud metric correlation analysis , 2015, 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE).

[4]  Liming Zhu,et al.  Detecting cloud provisioning errors using an annotated process model , 2013, MW4NextGen '13.

[5]  Dror G. Feitelson,et al.  Development and Deployment at Facebook , 2013, IEEE Internet Computing.

[6]  Jung-Min Park,et al.  An overview of anomaly detection techniques: Existing solutions and latest technological trends , 2007, Comput. Networks.

[7]  Domenico Cotroneo,et al.  Assessing and improving the effectiveness of logs for the analysis of software faults , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[8]  Wei Xu,et al.  Advances and challenges in log analysis , 2011, Commun. ACM.

[9]  Tanakorn Leesatapornwongsa,et al.  What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems , 2014, SoCC.

[10]  Liming Zhu,et al.  POD-Diagnosis: Error Diagnosis of Sporadic Operations on Cloud Applications , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[11]  J. Osborne Prediction in Multiple Regression , 2000 .

[12]  Priya Narasimhan,et al.  Failure Diagnosis of Complex Systems , 2012, Resilience Assessment and Evaluation of Computing Systems.

[13]  Liming Zhu,et al.  Discovering and Visualizing Operations Processes with POD-Discovery and POD-Viz , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[14]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[15]  Liming Zhu,et al.  Crying Wolf and Meaning It: Reducing False Alarms in Monitoring of Sporadic Operations through POD-Monitor , 2015, 2015 IEEE/ACM 1st International Workshop on Complex Faults and Failures in Large Software Systems (COUFLESS).

[16]  Ricardo Bianchini,et al.  Staged deployment in mirage, an integrated software upgrade testing and distribution system , 2007, SOSP.

[17]  Ellard T. Roush Cluster rolling upgrade using multiple version support , 2001, Proceedings 42nd IEEE Symposium on Foundations of Computer Science.

[18]  Humphrey Waita Njogu,et al.  An efficient approach to reduce alerts generated by multiple IDS products , 2014, Int. J. Netw. Manag..

[19]  Antonio Pescapè,et al.  Cloud monitoring: A survey , 2013, Comput. Networks.

[20]  Yu Luo,et al.  Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems , 2014, OSDI.

[21]  George Athanasopoulos,et al.  Forecasting: principles and practice , 2013 .

[22]  Antoni Wolski,et al.  Rolling Upgrades for Continuous Services , 2004, ISAS.

[23]  Liming Zhu,et al.  Error Diagnosis of Cloud Application Operation Using Bayesian Networks and Online Optimisation , 2015, 2015 11th European Dependable Computing Conference (EDCC).

[24]  Elizabeth A. Peck,et al.  Introduction to Linear Regression Analysis , 2001 .

[25]  Tudor Dumitras,et al.  Why Do Upgrades Fail and What Can We Do about It? , 2009, Middleware.

[26]  José M. Fernandez,et al.  ONTIDS: A Highly Flexible Context-Aware and Ontology-Based Alert Correlation Framework , 2013, FPS.

[27]  Vanish Talwar,et al.  Statistical techniques for online anomaly detection in data centers , 2011, 12th IFIP/IEEE International Symposium on Integrated Network Management (IM 2011) and Workshops.