Adaptive Monitoring of Complex Software Systems using Management Metrics

Software systems supporting networked, transaction-oriented services are large and complex; they comprise a multitude of inter-dependent layers and components, and they implement many dynamic optimization mechanisms. In addition, these systems are subject to workload that is hard to predict. These factors make monitoring these systems as well as performing problem determination challenging and costly. In this thesis we tackle these challenges with the goal of lowering the cost and improving the effectiveness of monitoring and problem determination by reducing the dependence on human operators. Specifically, this thesis presents and demonstrates the effectiveness of an efficient, automated monitoring approach which enables detection of errors and failures, and which assists in localizing faults. Software systems expose various types of monitoring data; this thesis focuses on the use of management metrics to monitor a system’s health. We devise a system modeling approach which entails modeling stable, statistical correlations among management metrics; these correlations characterize a system’s normal behaviour This approach allows a system model to be built automatically and efficiently using the monitoring data alone. In order to control the monitoring overhead, and yet allow a system’s health to be assessed reliably, we design an adaptive monitoring approach. This adaptive capability builds on the flexible nature of our system modeling approach, which allows the set of monitored metrics to be altered at runtime. We develop methods to automatically select management metrics to collect at the minimal monitoring level, without any domain knowledge. In addition, we devise an automated fault localization approach, which leverages the ability of the monitoring system to analyze individual metrics. Using a realistic, multi-tier software system, including different applications based on Java Enterprise Edition and industrial-strength products, we evaluate our system modeling approach. We show that stable metric correlations exist in complex software systems and that many of these correlations can be modeled using simple, efficient techniques. We investigate the effect of the collection of management metrics on system performance. We show that the monitoring overhead can be high and thus needs to be controlled. We employ fault injection experiments to evaluate the effectiveness of our adaptive monitoring and fault localization approach. We demonstrate that our approach is cost-effective, has high fault coverage and, in the majority of the cases studied, provides pertinent diagnosis information.

[1]  Jeffrey M. Wooldridge,et al.  Introductory Econometrics: A Modern Approach , 1999 .

[2]  Thomas Reidemeister,et al.  Information-theoretic modeling for tracking the health of complex software systems , 2008, CASCON '08.

[3]  Dallas E. Johnson,et al.  Analysis of Messy Data, Volume III: Analysis of Covariance , 2001 .

[4]  G. Caldarelli,et al.  Networks of equities in financial markets , 2004 .

[5]  Sheng Ma,et al.  Adaptive diagnosis in distributed systems , 2005, IEEE Transactions on Neural Networks.

[6]  Haifeng Chen,et al.  Modeling and Tracking of Transaction Flow Dynamics for Fault Detection in Complex Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[7]  Paul A. S. Ward,et al.  Interaction Analysis of Heterogeneous Monitoring Data for Autonomic Problem Determination , 2007, 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07).

[8]  Nalini Venkatasubramanian,et al.  Adaptive parameter collection in dynamic distributed environments , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[9]  Helen J. Wang,et al.  Live Monitoring: Using Adaptive Instrumentation and Analysis to Debug and Maintain Web Applications , 2007, HotOS.

[10]  P. Rousseeuw,et al.  Wiley Series in Probability and Mathematical Statistics , 2005 .

[11]  Kishor S. Trivedi,et al.  An approach for estimation of software aging in a Web server , 2002, Proceedings International Symposium on Empirical Software Engineering.

[12]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[13]  Martin Szummer,et al.  Snitch: interactive decision trees for troubleshooting misconfigurations , 2007 .

[14]  Manish Gupta,et al.  Discovering Dynamic Dependencies in Enterprise Environments for Problem Determination , 2003, DSOM.

[15]  Christopher Stewart,et al.  Performance modeling and system management for multi-component online services , 2005, NSDI.

[16]  Paul A. S. Ward,et al.  Leveraging many simple statistical models to adaptively monitor software systems , 2007, Int. J. High Perform. Comput. Netw..

[17]  K. Koehler Analysis of Messy Data, Vol III: Analysis of Covariance , 2002 .

[18]  Brian Tierney,et al.  NetLogger: A Toolkit for Distributed System Performance Tuning and Debugging , 2003, Integrated Network Management.

[19]  M. Natu,et al.  Efficient Probing Techniques for Fault Diagnosis , 2007, Second International Conference on Internet Monitoring and Protection (ICIMP 2007).

[20]  Robbert van Renesse,et al.  Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining , 2003, TOCS.

[21]  Haifeng Chen,et al.  Discovering likely invariants of distributed transaction systems for autonomic system management , 2006, 2006 IEEE International Conference on Autonomic Computing.

[22]  Dilma Da Silva,et al.  System Support for Online Reconfiguration , 2003, USENIX Annual Technical Conference, General Track.

[23]  Rolf Stadler,et al.  A-GAP: An Adaptive Protocol for Continuous Network Monitoring with Accuracy Objectives , 2007, IEEE Transactions on Network and Service Management.

[24]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[25]  Peter J. Denning,et al.  The Operational Analysis of Queueing Network Models , 1978, CSUR.

[26]  Virgílio A. F. Almeida,et al.  Performance by Design - Computer Capacity Planning By Example , 2004 .

[27]  Fan Zhang,et al.  Characterizing Normal Operation of a Web Server Application to Workload Forecasting and Problem Detection Proceedings of the Computer Measurement Group , 2011 .

[28]  Jun Li,et al.  Monitoring and characterization of component-based systems with global causality capture , 2003, 23rd International Conference on Distributed Computing Systems, 2003. Proceedings..

[29]  Dan Roth,et al.  Automated and Adaptive Threshold Setting: Enabling Technology for Autonomy and Self-Management , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[30]  Ada Diaconescu,et al.  Automatic performance management in component based software systems , 2004, International Conference on Autonomic Computing, 2004. Proceedings..

[31]  Armando Fox,et al.  Detecting application-level failures in component-based Internet services , 2005, IEEE Transactions on Neural Networks.

[32]  Moisés Goldszmidt,et al.  Short term performance forecasting in enterprise systems , 2005, KDD '05.

[33]  Vijay Mann,et al.  Fast Extraction of Adaptive Change Point Based Patterns for Problem Resolution in Enterprise Systems , 2006, DSOM.

[34]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[35]  Richard Mortier,et al.  Magpie: Online Modelling and Performance-aware Systems , 2003, HotOS.

[36]  Marin Litoiu,et al.  Tracking time-varying parameters in software systems with extended Kalman filters , 2015, CASCON.

[37]  Soila Pertet,et al.  Fingerpointing correlated failures in replicated systems , 2007 .

[38]  Qi Zhang,et al.  A Regression-Based Analytic Model for Dynamic Resource Provisioning of Multi-Tier Applications , 2007, Fourth International Conference on Autonomic Computing (ICAC'07).

[39]  Paul A. S. Ward,et al.  ADAPTIVE MONITORING IN ENTERPRISE SOFTWARE SYSTEMS , 2006 .

[40]  Matthew Arnold,et al.  A framework for reducing the cost of instrumented code , 2001, PLDI '01.

[41]  David Patterson,et al.  Self-repairing computers. , 2003, Scientific American.

[42]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[43]  Alan D. George,et al.  Adaptive Sampling for Network Management , 2001, Journal of Network and Systems Management.

[44]  Richard Murch,et al.  Autonomic Computing , 2004 .

[45]  D. Hecker Occupational employment projections to 2014 , 2001 .

[46]  S. Weisberg Applied Linear Regression , 1981 .

[47]  Daniel A. Menascé Web Server Software Architectures , 2003, IEEE Internet Comput..

[48]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[49]  Armando Fox,et al.  Using statistical monitoring to detect failures in internet services , 2005 .

[50]  Margo I. Seltzer,et al.  Self-monitoring and self-adapting operating systems , 1997, Proceedings. The Sixth Workshop on Hot Topics in Operating Systems (Cat. No.97TB100133).

[51]  Helen J. Wang,et al.  Automatic Misconfiguration Troubleshooting with PeerPressure , 2004, OSDI.

[52]  Lui Sha,et al.  Feedback control with queueing-theoretic prediction for relative delay guarantees in web servers , 2003, The 9th IEEE Real-Time and Embedded Technology and Applications Symposium, 2003. Proceedings..

[53]  Michael Jiang,et al.  Monitoring multi-tier clustered systems with invariant metric relationships , 2008, SEAMS '08.

[54]  Barton P. Miller,et al.  The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[55]  Haifeng Chen,et al.  Efficient and Scalable Algorithms for Inferring Likely Invariants in Distributed Systems , 2007, IEEE Transactions on Knowledge and Data Engineering.

[56]  Olivier Festor,et al.  On the Impact of Management on the Performance of a Managed System: A JMX-Based Management Case Study , 2005, DSOM.

[57]  Joel H. Saltz,et al.  A Performance Prediction Framework for Data Intensive Applications on Large Scale Parallel Machines , 1998, LCR.

[58]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[59]  Paul A. S. Ward,et al.  A comparative study of pairwise regression techniques for problem determination , 2007, CASCON.

[60]  Mikhail Dmitriev Profiling Java applications using code hotswapping and dynamic call graph revelation , 2004, WOSP '04.

[61]  Thomas Reidemeister,et al.  Automatic fault detection and diagnosis in complex software systems by information-theoretic monitoring , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[62]  Jeffrey D. Case,et al.  Simple Network Management Protocol (SNMP) , 1989, RFC.

[63]  William H. Crown,et al.  Statistical Models for the Social and Behavioral Sciences: Multiple Regression and Limited-Dependent Variable Models , 1998 .

[64]  Ming Zhong,et al.  I/O system performance debugging using model-driven anomaly characterization , 2005, FAST'05.

[65]  Soila Pertet,et al.  Causes of Failure in Web Applications (CMU-PDL-05-109) , 2005 .

[66]  Jeffrey D. Case,et al.  Simple Network Management Protocol (SNMP) , 1990, RFC.

[67]  Haifeng Chen,et al.  Combining supervised and unsupervised monitoring for fault detection in distributed computing systems , 2006, SAC '06.

[68]  David Mosberger,et al.  httperf—a tool for measuring web server performance , 1998, PERV.

[69]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[70]  Kien A. Hua,et al.  ADMiRe: an algebraic data mining approach to system performance analysis , 2005, IEEE Transactions on Knowledge and Data Engineering.

[71]  Matthias Hauswirth,et al.  Vertical profiling: understanding the behavior of object-priented applications , 2004, OOPSLA.

[72]  Aaron B. Brown,et al.  An active approach to characterizing dynamic dependencies for problem determination in a distributed environment , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[73]  George Candea,et al.  Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[74]  Benjamin Livshits,et al.  AjaxScope: a platform for remotely monitoring the client-side behavior of web 2.0 applications , 2007, TWEB.

[75]  Lennart Ljung,et al.  System Identification: Theory for the User , 1987 .

[76]  Fan Zhang,et al.  Characterizing Normal Operation of a Web Server: Application to Workload Forecasting and Problem Determination , 1998, Int. CMG Conference.

[77]  Yixin Diao Stochastic Modeling of Lotus Notes with a Queueing Model , 2001, Int. CMG Conference.

[78]  Thomas Reidemeister,et al.  Filtering System Metrics for Minimal Correlation-Based Self-Monitoring , 2009, 2009 Third IEEE International Conference on Self-Adaptive and Self-Organizing Systems.

[79]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[80]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[81]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[82]  Chris Hobbs A Practical Approach to WBEM/CIM Management , 2004 .

[83]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, TOCS.

[84]  Yixin Diao,et al.  Managing Web server performance with AutoTune agents , 2003 .

[85]  Leonard E. Trigg,et al.  Technical Note: Naive Bayes for Regression , 2000, Machine Learning.

[86]  Thomas Reidemeister,et al.  Detection and Diagnosis of Recurrent Faults in Software Systems by Invariant Analysis , 2008, 2008 11th IEEE High Assurance Systems Engineering Symposium.

[87]  Paul A. S. Ward,et al.  Better performance or better manageability? , 2005, ACM SIGSOFT Softw. Eng. Notes.

[88]  Barton P. Miller,et al.  Fine-grained dynamic instrumentation of commodity operating system kernels , 1999, OSDI '99.

[89]  Manish Gupta,et al.  Problem Determination Using Dependency Graphs and Run-Time Behavior Models , 2004, DSOM.

[90]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[91]  Matthew MacDonald The .NET Framework , 2010 .

[92]  Thomas Reidemeister,et al.  Adaptive Monitoring with Dynamic Differential Tracing-Based Diagnosis , 2008, DSOM.

[93]  Dave Cliff,et al.  HP Labs' Complex Adaptive Systems Group Research Overview † , 2004 .

[94]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[95]  Dejan S. Milojicic,et al.  QMON: QoS- and Utility-Aware Monitoring in Enterprise Systems , 2006, 2006 IEEE International Conference on Autonomic Computing.

[96]  Bryan Cantrill,et al.  Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.

[97]  Paramvir Bahl,et al.  Discovering Dependencies for Network Management , 2006, HotNets.

[98]  Thomas Reidemeister,et al.  Heteroscedastic models to track relationships between management metrics , 2009, 2009 IFIP/IEEE International Symposium on Integrated Network Management.

[99]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[100]  Thomas Reidemeister,et al.  System monitoring with metric-correlation models: problems and solutions , 2009, ICAC '09.

[101]  Michael I. Jordan,et al.  Failure diagnosis using decision trees , 2004 .

[102]  Cristiana Amza,et al.  Semantic-Driven Model Composition for Accurate Anomaly Diagnosis , 2008, 2008 International Conference on Autonomic Computing.

[103]  R. Mantegna Hierarchical structure in financial markets , 1998, cond-mat/9802256.

[104]  Mark W. Johnson Monitoring and Diagnosing Applications with ARM 4.0 , 2004, Int. CMG Conference.

[105]  Asser N. Tantawi,et al.  An analytical model for multi-tier internet services and its applications , 2005, SIGMETRICS '05.

[106]  Christopher Stewart,et al.  Exploiting nonstationarity for performance prediction , 2007, EuroSys '07.

[107]  Zhen Guo,et al.  Tracking Probabilistic Correlation of Monitoring Data for Fault Detection in Complex Systems , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[108]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[109]  Vijay Mann,et al.  Problem Determination in Enterprise Middleware Systems using Change Point Correlation of Time Series Data , 2006, 2006 IEEE/IFIP Network Operations and Management Symposium NOMS 2006.

[110]  Miroslaw Malek,et al.  Self – Rejuvenation-an Effective Way to High Availability , 2004 .

[111]  Barton P. Miller,et al.  Autonomous analysis of interactive systems with self-propelled instrumentation , 2005, IS&T/SPIE Electronic Imaging.

[112]  Ian H. Witten,et al.  Naive Bayes for Regression (Technical Note) , 2000, Machine-mediated learning.

[113]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[114]  Yixin Diao,et al.  Generic on-line discovery of quantitative models for service level management , 2003, IFIP/IEEE Eighth International Symposium on Integrated Network Management, 2003..

[115]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[116]  Haifeng Chen,et al.  Failure detection and localization in component based systems by online tracking , 2005, KDD '05.

[117]  Armando Fox,et al.  Ensembles of models for automated diagnosis of system performance problems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[118]  Jerome A. Rolia,et al.  Measurement Tools and Modeling Techniques for Evaluating Web Server Performance , 1997, Computer Performance Evaluation.