CloudRanger: Root Cause Identification for Cloud Native Systems

As more and more systems are migrating to cloud environment, the cloud native system becomes a trend. This paper presents the challenges and implications when diagnosing root causes for cloud native systems by analyzing some real incidents occurred in IBM Bluemix (a large commercial cloud). To tackle these challenges, we propose CloudRanger, a novel system dedicated for cloud native systems. To make our system more general, we propose a dynamic causal relationship analysis approach to construct impact graphs amongst applications without given the topology. A heuristic investigation algorithm based on second-order random walk is proposed to identify the culprit services which are responsible for cloud incidents. Experimental results in both simulation environment and IBM Bluemix platform show that CloudRanger outperforms some state-of-the-art approaches with a 10% improvement in accuracy. It offers a fast identification of culprit services when an anomaly occurs. Moreover, this system can be deployed rapidly and easily in multiple kinds of cloud native systems without any predefined knowledge.

[1]  Boris N. Oreshkin,et al.  Machine learning approaches to network anomaly detection , 2007 .

[2]  Malgorzata Steinder,et al.  A survey of fault localization techniques in computer networks , 2004, Sci. Comput. Program..

[3]  Sam Shah,et al.  Root cause detection in a service-oriented architecture , 2013, SIGMETRICS '13.

[4]  John Sharp,et al.  Cloud Design Patterns: Prescriptive Architecture Guidance for Cloud Applications , 2014 .

[5]  Marko Becker,et al.  Service Oriented Architecture Concepts Technology And Design , 2016 .

[6]  Xiang Zhang,et al.  Remember Where You Came From: On The Second-Order Random Walk Based Proximity Measures , 2016, Proc. VLDB Endow..

[7]  W. Wong,et al.  Learning Causal Bayesian Network Structures From Experimental Data , 2008 .

[8]  Ping Wang,et al.  Lightweight and Adaptive Service API Performance Monitoring in Highly Dynamic Cloud Environment , 2017, 2017 IEEE International Conference on Services Computing (SCC).

[9]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[10]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[11]  Wilhelm Hasselbring,et al.  Automatic Failure Diagnosis Support in Distributed Large-Scale Software Systems Based on Timing Behavior Anomaly Correlation , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[12]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[13]  Peter Bühlmann,et al.  Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm , 2007, J. Mach. Learn. Res..

[14]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[15]  Jiawei Han,et al.  Modeling Probabilistic Measurement Correlations for Problem Determination in Large-Scale Distributed Systems , 2009, 2009 29th IEEE International Conference on Distributed Computing Systems.

[16]  Hongliang Fei,et al.  Anomaly localization for network data streams with graph joint sparse PCA , 2011, KDD.

[17]  Dan Geiger,et al.  d-Separation: From Theorems to Algorithms , 2013, UAI.

[18]  Armando Fox,et al.  Ensembles of models for automated diagnosis of system performance problems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[19]  Danai Koutra,et al.  Graph based anomaly detection and description: a survey , 2014, Data Mining and Knowledge Discovery.

[20]  Vanish Talwar,et al.  VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications , 2012, Middleware.

[21]  Yong Guan,et al.  A distributed data streaming algorithm for network-wide traffic anomaly detection , 2009, PERV.