CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems

Modern applications especially cloud-based or cloud-centric applications always have many components running in the large distributed environment with complex interactions. They are vulnerable to suffer from performance or availability problems due to the highly dynamic runtime environment such as resource hogs, configuration changes and software bugs. In order to make efficient software maintenance and provide some hints to software bugs, we build a system named CauseInfer, a low cost and blackbox cause inference system without instrumenting the application source code. CauseInfer can automatically construct a two layered hierarchical causality graph and infer the causes of performance problems along the causal paths in the graph with a series of statistical methods. According to the experimental evaluation in the controlled environment, we find out CauseInfer can achieve an average 80% precision and 85% recall in a list of top two causes to identify the root causes, higher than several state-of-the-art methods and a good scalability to scale up in the distributed systems.

[1]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[2]  Tom Burr,et al.  Causation, Prediction, and Search , 2003, Technometrics.

[3]  Armando Fox,et al.  Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.

[4]  Kamesh Munagala,et al.  Fa: A System for Automating Failure Diagnosis , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[5]  Xu Chen,et al.  Automating Network Application Dependency Discovery: Experiences, Limitations, and New Solutions , 2008, OSDI.

[6]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[7]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[8]  Richard Mortier,et al.  Constellation: automated discovery of service and host dependencies in networked systems , 2008 .

[9]  AmmarMostafa,et al.  Answering what-if deployment and configuration questions with wise , 2008 .

[10]  J. Hartigan,et al.  A Bayesian Analysis for Change Point Problems , 1993 .

[11]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[12]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[13]  Qiang Fu,et al.  Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[14]  Xiaohui Gu,et al.  FChain: Toward Black-Box Online Fault Localization for Cloud Systems , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[15]  Qiang Fu,et al.  Mining program workflow from interleaved traces , 2010, KDD.

[16]  Evgenia Smirni,et al.  Automated anomaly detection and performance modeling of enterprise applications , 2009, TOCS.

[17]  Paramvir Bahl,et al.  Detailed diagnosis in enterprise networks , 2009, SIGCOMM '09.

[18]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[19]  Chita R. Das,et al.  CloudPD: Problem determination and diagnosis in shared dynamic clouds , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[20]  Xiaohui Gu,et al.  PAL: Propagation-aware Anomaly Localization for cloud hosted distributed applications , 2011, SLAML '11.

[21]  Peter Bühlmann,et al.  Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm , 2007, J. Mach. Learn. Res..

[22]  R. Krishnakumar Kernel korner: kprobes-a kernel debugger , 2005 .