Understanding Vicious Cycles in Server Clusters

In this paper, we present an automated on-line service for troubleshooting performance problems in server clusters caused by unintended vicious cycles. The tool complements a large volume of prior performance troubleshooting and diagnostic literature for server farms that identifies problems arising due to resource bottlenecks or failed components. We show that unintended interactions between components in large-scale systems can cause performance problems even in the absence of bottlenecks or failures. Our tool leverages discriminative sequence mining to identify anomalous sequences of events that are candidates for blame for the performance problem. The tool looks for patterns consistent with "vicious cycles" or unstable behavior, as such patterns, when present, are most likely to be problematic. It highlights candidates that are semantically conflicting, such as those arising when different performance management mechanisms make adjustments in conflicting directions. Our approach offers two key advantages in performance troubleshooting. First, it does not require detailed prior knowledge of the underlying system to diagnose the problem. Second, contrary to simple statistical techniques, such as correlation analysis, that work well for continuous variables, our scheme can also identify chains of events (labels) that may explain the root cause of a problem. Our service is deployed on a web server testbed of 17 machines. To make the comparison of our scheme to prior work more concrete, we first reproduce two real-life problem scenarios reported in earlier literature, then explore a third, new case study. In all cases, our tool reports the patterns that explain the cause of the problem without requiring detailed a priori knowledge.

[1]  Salim Hariri,et al.  Autonomic power and performance management for computing systems , 2006, 2006 IEEE International Conference on Autonomic Computing.

[2]  Ramesh Govindan,et al.  Route flap damping exacerbates internet routing convergence , 2002, SIGCOMM 2002.

[3]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[4]  Steven Hand,et al.  Self-adaptive and self-configured CPU resource provisioning for virtualized servers using Kalman filters , 2009, ICAC '09.

[6]  Matthias Hauswirth,et al.  Automating vertical profiling , 2005, OOPSLA '05.

[7]  Wolfram Schulte,et al.  Spying on Components: A Runtime Verification Technique , 2001 .

[8]  Fabienne Boyer,et al.  Self-adapting Service Level in Java Enterprise Edition , 2009, Middleware.

[9]  Tarek F. Abdelzaher,et al.  Towards Diagnostic Simulation in Sensor Networks , 2008, DCOSS.

[10]  Nagarajan Kandasamy,et al.  Adaptive Performance Control of Computing Systems via Distributed Cooperative Control: Application to Power Management in Computing Clusters , 2006, 2006 IEEE International Conference on Autonomic Computing.

[11]  K. K. Ramakrishnan,et al.  Eliminating receive livelock in an interrupt-driven kernel , 1996, TOCS.

[12]  Jiawei Han,et al.  Dustminer: troubleshooting interactive complexity bugs in sensor networks , 2008, SenSys '08.

[13]  Jiawei Han,et al.  Finding Symbolic Bug Patterns in Sensor Networks , 2009, DCOSS.

[14]  Tarek F. Abdelzaher,et al.  AdaptGuard: guarding adaptive systems from instability , 2009, ICAC '09.

[15]  Xue Liu,et al.  Dynamic Voltage Scaling in Multitier Web Servers with End-to-End Delay Control , 2007, IEEE Transactions on Computers.

[16]  Saurabh Bagchi,et al.  Distributed Diagnosis of Failures in a Three Tier E-Commerce System , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[17]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[18]  Jeffrey C. Mogul,et al.  Emergent (mis)behavior vs. complex software systems , 2006, EuroSys.

[19]  Hanêne Ben-Abdallah,et al.  Formally specified monitoring of temporal properties , 1999, Proceedings of 11th Euromicro Conference on Real-Time Systems. Euromicro RTS'99.

[20]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[21]  Asser N. Tantawi,et al.  An adaptive feedback controller for SIP server memory overload protection , 2009, ICAC '09.

[22]  Insik Shin,et al.  OptiTuner: An Automatic Distributed Performance Optimization Service and a Server Farm Application , 2009 .

[23]  Helen J. Wang,et al.  Strider: a black-box, state-based approach to change and configuration management and support , 2003, Sci. Comput. Program..

[24]  Grigore Rosu,et al.  An Overview of the Runtime Verification Tool Java PathExplorer , 2004, Formal Methods Syst. Des..

[25]  Xue Liu,et al.  Integrating Adaptive Components: An Emerging Challenge in Performance-Adaptive Systems and a Server Farm Case-Study , 2007, 28th IEEE International Real-Time Systems Symposium (RTSS 2007).