论文信息 - Automatic Root-cause Diagnosis of Performance Anomalies in Production Software

Automatic Root-cause Diagnosis of Performance Anomalies in Production Software

Troubleshooting the performance of complex production sof tware is challenging. Most existing tools, such as profiling, trac ing, and logging systems, reveal what events occurred during performance anomalies. However, the users of such tools must then infer why these events occurred during a particular execution; e.g., that their execution was due to a specific input request or configuration setting. Because manual root cause determination is time-cons umi g and difficult, this paper introduces performance summarization , a technique for automatically inferring the root cause of per ormance problems. Performance summarization first attributes perf ormance costs to fine-grained events such as individual instruction s and system calls. It then uses dynamic information flow to determine the probable root causes for the execution of each event. The cos t of each event is assigned to root causes according to the relati ve probability that the causes led to the execution of that event. Fi nally, the total cost for each root cause is calculated by summing th e percause costs of all events. This paper also describes a differ ential form of performance summarization that compares two activi ties. We have implemented a tool called X-ray that performs perfor mance summarization. Our experimental results show that Xray accurately diagnoses 14 performance issues in the Apache HT TP server, Postfix mail server and PostgreSQL database, while a dding only 1–7% overhead to production systems.

J. Flinn | Michael Chow | Mona Attariyan

[1] Fred B. Schneider,et al. Hypervisor-based fault tolerance , 1996, TOCS.

[2] Timothy Roscoe,et al. Resource overbooking and application profiling in shared hosting platforms , 2002, OSDI '02.

[3] Samuel T. King,et al. ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[4] Eric A. Brewer,et al. Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[5] Marcos K. Aguilera,et al. Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[6] Helen J. Wang,et al. Strider: a black-box, state-based approach to change and configuration management and support , 2003, Sci. Comput. Program..

[7] Yixin Diao,et al. Managing Web server performance with AutoTune agents , 2003 .

[8] Jeffrey S. Chase,et al. Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[9] Helen J. Wang,et al. Automatic Misconfiguration Troubleshooting with PeerPressure , 2004, OSDI.

[10] Steven D. Gribble,et al. Configuration Debugging as Search: Finding the Needle in the Haystack , 2004, OSDI.

[11] Richard Mortier,et al. Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.