Automatic Problem Localization via Multi-dimensional Metric Profiling

Debugging today's large-scale distributed applications is complex. Traditional debugging techniques such as breakpoint-based debugging and performance profiling require a substantial amount of domain knowledge and do not automate the process of locating bugs and performance anomalies. We present Orion, a framework to automate the problem-localization process in distributed applications. From a large set of metrics, Orion intelligently chooses important metrics and models the application's runtime behavior through pair wise correlations of those metrics in the system, within multiple non-overlapping time windows. When correlations deviate from those of a learned correct model due to a bug, our analysis pinpoints the metrics and code regions (class and method within it) that are most likely associated with the failure. We demonstrate our framework with several real-world failure cases in distributed applications such as: HBase, Hadoop DFS, a campus-wide Java application, and a regression testing framework from IBM. Our results show that Orion is able to pinpoint the metrics and code regions that developers need to concentrate on to fix the failures.

[1]  Mona Attariyan,et al.  X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.

[2]  Dmitri Bronnikov A practical adoption of partial redundancy elimination , 2004, SIGP.

[3]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[4]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[5]  Armando Fox,et al.  Ensembles of models for automated diagnosis of system performance problems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[6]  Scott Shenker,et al.  Replay debugging for distributed applications , 2006 .

[7]  Ion Stoica,et al.  Friday: Global Comprehension for Distributed Replay , 2007, NSDI.

[8]  Armando Fox,et al.  Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.

[9]  Qiang Fu,et al.  Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[10]  Dawson R. Engler,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Cmc: a Pragmatic Approach to Model Checking Real Code , 2022 .

[11]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[12]  Xuezheng Liu,et al.  D3S: Debugging Deployed Distributed Systems , 2008, NSDI.

[13]  Jennifer Neville,et al.  Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems , 2012, NSDI.

[14]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[15]  Saharon Rosset,et al.  Analyzing system logs: a new view of what's important , 2007 .

[16]  J. Flinn,et al.  Automatic Root-cause Diagnosis of Performance Anomalies in Production Software , 2011 .

[17]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[18]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[19]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[20]  Amin Vahdat,et al.  Life, death, and the critical transition: finding liveness bugs in systems code , 2007 .

[21]  Priya Narasimhan,et al.  Hardware performance counter-based problem diagnosis for e-commerce systems , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[22]  Barton P. Miller,et al.  Problem Diagnosis in Large-Scale Computing Environments , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[23]  Jiawei Han,et al.  Modeling Probabilistic Measurement Correlations for Problem Determination in Large-Scale Distributed Systems , 2009, 2009 29th IEEE International Conference on Distributed Computing Systems.