Reference-driven performance anomaly identification

Complex system software allows a variety of execution conditions on system configurations and workload properties. This paper explores a principled use of reference executions--those of similar execution conditions from the target--to help identify the symptoms and causes of performance anomalies. First, to identify anomaly symptoms, we construct change profiles that probabilistically characterize expected performance deviations between target and reference executions. By synthesizing several single-parameter change profiles, we can scalably identify anomalous reference-to-target changes in a complex system with multiple execution parameters. Second, to narrow the scope of anomaly root cause analysis, we filter anomaly-related low-level system metrics as those that manifest very differently between target and reference executions. Our anomaly identification approach requires little expert knowledge or detailed models on system internals and consequently it can be easily deployed. Using empirical case studies on the Linux I/O subsystem and a J2EE-based distributed online service, we demonstrate our approach's effectiveness in identifying performance anomalies over a wide range of execution conditions as well as multiple system software versions. In particular, we discovered five previously unknown performance anomaly causes in the Linux 2.6.23 kernel. Additionally, our preliminary results suggest that online anomaly detection and system reconfiguration may help evade performance anomalies in complex online systems.

[1]  Kai Shen,et al.  Competitive prefetching for concurrent sequential I/O , 2007, EuroSys '07.

[2]  Rory A. Fisher,et al.  The Arrangement of Field Experiments , 1992 .

[3]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[4]  ZellerAndreas Isolating cause-effect chains from computer programs , 2002 .

[5]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[6]  Erez Zadok,et al.  DARC: dynamic analysis of root causes of latency distributions , 2008, SIGMETRICS '08.

[7]  Nikolai Joukov,et al.  Operating system profiling via latency analysis , 2006, OSDI '06.

[8]  Gregory R. Ganger,et al.  Modeling the relative fitness of storage , 2007, SIGMETRICS '07.

[9]  A. Jefferson Offutt,et al.  Combination testing strategies: a survey , 2005, Softw. Test. Verification Reliab..

[10]  Christopher Stewart,et al.  Exploiting nonstationarity for performance prediction , 2007, EuroSys '07.

[11]  Helen J. Wang,et al.  Automatic Misconfiguration Troubleshooting with PeerPressure , 2004, OSDI.

[12]  Peter Druschel,et al.  Anticipatory scheduling: a disk scheduling framework to overcome deceptive idleness in synchronous I/O , 2001, SOSP.

[13]  A. Zeller Isolating cause-effect chains from computer programs , 2002, SIGSOFT '02/FSE-10.

[14]  Kai Shen,et al.  Managing prefetch memory for data-intensive online servers , 2005, FAST'05.

[15]  Yuanyuan Zhou,et al.  Triage: diagnosing production run failures at the user's site , 2007, SOSP.

[16]  Christopher Stewart,et al.  Performance modeling and system management for multi-component online services , 2005, NSDI.

[17]  Gregory R. Ganger,et al.  Ironmodel: robust performance models in the wild , 2008, SIGMETRICS '08.

[18]  Ming Zhong,et al.  I/O system performance debugging using model-driven anomaly characterization , 2005, FAST'05.