EntomoModel: Understanding and Avoiding Performance Anomaly Manifestations

Subtle implementation errors or mis-configurations in complex Internet services may lead to performance degradations without causing failures. These undiscovered performance anomalies afflict many of today’s systems, causing violations of service-level agreements (SLAs), unnecessary resource over provisioning, or both. In this paper, we re-inserted realistic anomaly causes into a multi-tier Internet service architecture and studied their manifestations. We observed that each cause had certain workload and management parameters that were more likely to trigger manifestations, hinting that such parameters could be effective classifiers. This observation held even when anomaly causes manifested differently in combination than in isolation. Our study motivates EntomoModel, a framework for depicting performance anomaly manifestations. EntomoModel uses decision tree classification and a design-driven performance model to characterize the workload and management policy settings under which manifestations are likely. EntomoModel enables online system management that avoids anomaly manifestations by dynamically adjusting system management parameters. Our trace-driven evaluations show that manifestation avoidance based on EntomoModel, or entomophobic management, can reduce 98th percentile SLA violations by 67% compared to an anomaly oblivious adaptive approach. In a cloud computing scenario with elastic resource allocation, our approach uses less than half of the resources needed in static over-provisioning.

[1]  Kang G. Shin,et al.  Adaptive control of virtualized resources in utility computing environments , 2007, EuroSys '07.

[2]  Xin Li,et al.  Reference-driven performance anomaly identification , 2009, SIGMETRICS '09.

[3]  Armando Fox,et al.  Detecting application-level failures in component-based Internet services , 2005, IEEE Transactions on Neural Networks.

[4]  Asser N. Tantawi,et al.  An analytical model for multi-tier internet services and its applications , 2005, SIGMETRICS '05.

[5]  Tao Yang,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Integrated Resource Management for Cluster-based Internet Services , 2022 .

[6]  Christopher Stewart,et al.  Exploiting nonstationarity for performance prediction , 2007, EuroSys '07.

[7]  Peter J. Denning,et al.  The Operational Analysis of Queueing Network Models , 1978, CSUR.

[8]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[9]  Christopher Stewart,et al.  Empirical examination of a collaborative web application , 2008, 2008 IEEE International Symposium on Workload Characterization.

[10]  Christopher Stewart,et al.  A Dollar from 15 Cents: Cross-Platform Management for Internet Services , 2008, USENIX Annual Technical Conference.

[11]  Jeffrey C. Mogul,et al.  Emergent (mis)behavior vs. complex software systems , 2006, EuroSys.

[12]  Kang G. Shin,et al.  Automated control of multiple virtualized resources , 2009, EuroSys '09.

[13]  Gregory R. Ganger,et al.  Ironmodel: robust performance models in the wild , 2008, SIGMETRICS '08.

[14]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[15]  Lui Sha,et al.  Adaptive Control of Multi-Tiered Web Applications Using Queueing Predictor , 2006, 2006 IEEE/IFIP Network Operations and Management Symposium NOMS 2006.

[16]  Virgílio A. F. Almeida,et al.  Capacity Planning for Web Services: Metrics, Models, and Methods , 2001 .

[17]  C. Särndal,et al.  A comparative study of association measures , 1974 .

[18]  Christopher Stewart,et al.  Performance modeling and system management for multi-component online services , 2005, NSDI.