论文信息 - Automatic Problem Localization via Multi-dimensional Metric Profiling

Automatic Problem Localization via Multi-dimensional Metric Profiling

Debugging today's large-scale distributed applications is complex. Traditional debugging techniques such as breakpoint-based debugging and performance profiling require a substantial amount of domain knowledge and do not automate the process of locating bugs and performance anomalies. We present Orion, a framework to automate the problem-localization process in distributed applications. From a large set of metrics, Orion intelligently chooses important metrics and models the application's runtime behavior through pair wise correlations of those metrics in the system, within multiple non-overlapping time windows. When correlations deviate from those of a learned correct model due to a bug, our analysis pinpoints the metrics and code regions (class and method within it) that are most likely associated with the failure. We demonstrate our framework with several real-world failure cases in distributed applications such as: HBase, Hadoop DFS, a campus-wide Java application, and a regression testing framework from IBM. Our results show that Orion is able to pinpoint the metrics and code regions that developers need to concentrate on to fix the failures.

[1] Mona Attariyan,et al. X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.

[2] Dmitri Bronnikov. A practical adoption of partial redundancy elimination , 2004, SIGP.

[3] Susan L. Graham,et al. Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[4] Gregory R. Ganger,et al. Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[5] Armando Fox,et al. Ensembles of models for automated diagnosis of system performance problems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[6] Scott Shenker,et al. Replay debugging for distributed applications , 2006 .

[7] Ion Stoica,et al. Friday: Global Comprehension for Distributed Replay , 2007, NSDI.

[8] Armando Fox,et al. Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.

[9] Qiang Fu,et al. Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[10] Dawson R. Engler,et al. Proceedings of the 5th Symposium on Operating Systems Design and Implementation Cmc: a Pragmatic Approach to Model Checking Real Code , 2022 .

[11] Richard Mortier,et al. Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[12] Xuezheng Liu,et al. D3S: Debugging Deployed Distributed Systems , 2008, NSDI.

[13] Jennifer Neville,et al. Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems , 2012, NSDI.

[14] Randy H. Katz,et al. X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[15] Saharon Rosset,et al. Analyzing system logs: a new view of what's important , 2007 .

[16] J. Flinn,et al. Automatic Root-cause Diagnosis of Performance Anomalies in Production Software , 2011 .

[17] Michael I. Jordan,et al. Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[18] Armando Fox,et al. Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[19] David A. Patterson,et al. Path-Based Failure and Evolution Management , 2004, NSDI.

[20] Amin Vahdat,et al. Life, death, and the critical transition: finding liveness bugs in systems code , 2007 .

[21] Priya Narasimhan,et al. Hardware performance counter-based problem diagnosis for e-commerce systems , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[22] Barton P. Miller,et al. Problem Diagnosis in Large-Scale Computing Environments , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[23] Jiawei Han,et al. Modeling Probabilistic Measurement Correlations for Problem Determination in Large-Scale Distributed Systems , 2009, 2009 29th IEEE International Conference on Distributed Computing Systems.