Efficient Control of False Negative and False Positive Errors with Separate Adaptive Thresholds

Component level performance thresholds are widely used as a basic means for performance management. As the complexity of managed applications increases, manual threshold maintenance becomes a difficult task. Complexity arises from having a large number of application components and their operational metrics, dynamically changing workloads, and compound relationships between application components. To alleviate this problem, we advocate that component level thresholds should be computed, managed and optimized automatically and autonomously. To this end, we have designed and implemented a performance threshold management application that automatically and dynamically computes two separate component level thresholds: one for controlling Type I errors and another for controlling Type II errors. Our solution additionally facilitates metric selection thus minimizing management overheads. We present the theoretical foundation for this autonomic threshold management application, describe a specific algorithm and its implementation, and evaluate it using real-life scenarios and production data sets. As our present study shows, with proper parameter tuning, our on-line dynamic solution is capable of nearly optimal performance thresholds calculation.

[1]  A. Agresti Categorical data analysis , 1993 .

[2]  Karen Appleby,et al.  Threshold management for problem determination in transaction based e-commerce systems , 2005, 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005..

[3]  Onn Shehory,et al.  Performance management via adaptive thresholds with separate control of false positive and false negative errors , 2009, 2009 IFIP/IEEE International Symposium on Integrated Network Management.

[4]  Onn Shehory,et al.  Root-cause analysis of SAN performance problems: an I/O path affine search approach , 2005, 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005..

[5]  Dan Roth,et al.  Automated and Adaptive Threshold Setting: Enabling Technology for Autonomy and Self-Management , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[6]  Emily Halili,et al.  Apache JMeter , 2008 .

[7]  Jingde Cheng,et al.  Detection of Network Faults and Performance Problems , 2001 .

[8]  Joseph L. Hellerstein GAP: A General Approach to Quantitative Diagnosis of Performance Problems , 2004, Journal of Network and Systems Management.

[9]  Joseph L. Hellerstein,et al.  Predictive algorithms in the management of computer systems , 2002, IBM Syst. J..

[10]  Dejan S. Milojicic,et al.  Systematically Translating Service Level Objectives into Design and Operational Policies for Multi-Tier Applications , 2008 .

[11]  Fan Zhang,et al.  A statistical approach to predictive detection , 2001, Comput. Networks.

[12]  Randy H. Katz,et al.  Effective web service load balancing through statistical monitoring , 2006, Commun. ACM.

[13]  Sheng Ma,et al.  Data-driven monitoring design of service level and resource utilization , 2005, 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005..

[14]  Dejan S. Milojicic,et al.  SLA Decomposition: Translating Service Level Objectives to System Level Thresholds , 2007, Fourth International Conference on Autonomic Computing (ICAC'07).

[15]  Manish Gupta,et al.  Problem Determination Using Dependency Graphs and Run-Time Behavior Models , 2004, DSOM.

[16]  Jake D. Brutlag,et al.  Aberrant Behavior Detection in Time Series for Network Monitoring , 2000, LISA.

[17]  Marina Thottan,et al.  Adaptive thresholding for proactive network problem detection , 1998, Proceedings of the IEEE Third International Workshop on Systems Management.

[18]  Dejan S. Milojicic,et al.  A systematic and practical approach to generating policies from service level objectives , 2009, 2009 IFIP/IEEE International Symposium on Integrated Network Management.

[19]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[20]  Jay Lepreau,et al.  Computer System Performance Problem Detection Using Time Series Model , 1993, USENIX Summer.

[21]  Onn Shehory,et al.  Derivation of Response Time Service Level Objectives for Business Services , 2007, 2007 2nd IEEE/IFIP International Workshop on Business-Driven IT Management.

[22]  Mark Burgess Two Dimensional Time-Series for Anomaly Detection and Regulation in Adaptive Systems , 2002, DSOM.

[23]  Peter W. Glynn,et al.  Internet service performance failure detection , 1998, PERV.