Automatic Root-cause Diagnosis of Performance Anomalies in Production Software

Troubleshooting the performance of complex production sof tware is challenging. Most existing tools, such as profiling, trac ing, and logging systems, reveal what events occurred during performance anomalies. However, the users of such tools must then infer why these events occurred during a particular execution; e.g., that their execution was due to a specific input request or configuration setting. Because manual root cause determination is time-cons umi g and difficult, this paper introduces performance summarization , a technique for automatically inferring the root cause of per ormance problems. Performance summarization first attributes perf ormance costs to fine-grained events such as individual instruction s and system calls. It then uses dynamic information flow to determine the probable root causes for the execution of each event. The cos t of each event is assigned to root causes according to the relati ve probability that the causes led to the execution of that event. Fi nally, the total cost for each root cause is calculated by summing th e percause costs of all events. This paper also describes a differ ential form of performance summarization that compares two activi ties. We have implemented a tool called X-ray that performs perfor mance summarization. Our experimental results show that Xray accurately diagnoses 14 performance issues in the Apache HT TP server, Postfix mail server and PostgreSQL database, while a dding only 1–7% overhead to production systems.

[1]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[2]  Timothy Roscoe,et al.  Resource overbooking and application profiling in shared hosting platforms , 2002, OSDI '02.

[3]  Samuel T. King,et al.  ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[4]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[5]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[6]  Helen J. Wang,et al.  Strider: a black-box, state-based approach to change and configuration management and support , 2003, Sci. Comput. Program..

[7]  Yixin Diao,et al.  Managing Web server performance with AutoTune agents , 2003 .

[8]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[9]  Helen J. Wang,et al.  Automatic Misconfiguration Troubleshooting with PeerPressure , 2004, OSDI.

[10]  Steven D. Gribble,et al.  Configuration Debugging as Search: Finding the Needle in the Haystack , 2004, OSDI.

[11]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[12]  Bryan Cantrill,et al.  Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.

[13]  Vivek S. Pai,et al.  Proceedings of the General Track: 2004 Usenix Annual Technical Conference Making the " Box " Transparent: System Call Performance as a First-class Result , 2022 .

[14]  Srikanth Kandula,et al.  Flashback: A Light-weight Rollback and Deterministic Replay Extension for Software Debugging , 2004 .

[15]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[16]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[17]  Christopher Stewart,et al.  Performance modeling and system management for multi-component online services , 2005, NSDI.

[18]  Scott Shenker,et al.  Replay debugging for distributed applications , 2006 .

[19]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[20]  Wei Zheng,et al.  Automatic configuration of internet services , 2007, EuroSys '07.

[21]  Mona Attariyan,et al.  AutoBash: improving configuration management with operating system causality analysis , 2007, SOSP.

[22]  Min Xu ReTrace : Collecting Execution Trace with Virtual Machine Deterministic Replay , 2007 .

[23]  Abhishek Kumar,et al.  Lightweight, High-Resolution Monitoring for Troubleshooting Production Systems , 2008, OSDI.

[24]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[25]  Haifeng Chen,et al.  Boosting the performance of computing systems through adaptive configuration tuning , 2009, SAC '09.

[26]  Yuanyuan Zhou,et al.  PRES: probabilistic replay with execution sketching on multiprocessors , 2009, SOSP '09.

[27]  Ion Stoica,et al.  ODR: output-deterministic replay for multicore debugging , 2009, SOSP '09.

[28]  Jason Flinn,et al.  quFiles: The right file at the right time , 2010, TOS.

[29]  Brad Chen,et al.  Locating System Problems Using Dynamic Instrumentation , 2010 .

[30]  Rajeev Gandhi,et al.  Black-Box Problem Diagnosis in Parallel File Systems , 2010, FAST.

[31]  Mona Attariyan,et al.  Automating Configuration Troubleshooting with Dynamic Information Flow Analysis , 2010, OSDI.

[32]  James Cownie,et al.  PinPlay: a framework for deterministic replay and reproducible analysis of parallel programs , 2010, CGO '10.

[33]  George Candea,et al.  S2E: a platform for in-vivo multi-path analysis of software systems , 2011, ASPLOS XVI.

[34]  Minlan Yu,et al.  Profiling Network Performance for Multi-tier Data Center Applications , 2011, NSDI.

[35]  Satish Narayanasamy,et al.  DoublePlay: parallelizing sequential logging and replay , 2011, ASPLOS XVI.

[36]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.