Filtering System Metrics for Minimal Correlation-Based Self-Monitoring

Self-adaptive and self-organizing systems must be self-monitoring. Recent research has shown that self-monitoring can be enabled by using correlations between monitoring variables (metrics). However, computer systems often make a very large number of metrics available for collection. Collecting them all not only reduces system performance, but also creates other overheads related to communication, storage, and processing. In order to control the overhead, it is necessary to limit collection to a subset of the available metrics. Manual selection of metrics requires a good understanding of system internals, which can be difficult given the size and complexity of modern computer systems. In this paper, assuming no knowledge of metric semantics or importance and no advance availability of fault data, we investigate automated methods for selecting a subset of available metrics in the context of correlation-based monitoring. Our goal is to collect fewer metrics while maintaining the ability to detect errors. We propose several metric selection methods that require no information beside correlations. We compare these methods on the basis of fault coverage. We show that our minimum spanning tree-based selection performs best, detecting on average 66% of faults detectable by full monitoring (i.e., using all considered metrics) with only 30% of the metrics.

[1]  Peter A. Dinda,et al.  Windows Performance Monitoring and Data Reduction Using WatchTower , 2001 .

[2]  Haifeng Chen,et al.  Modeling and Tracking of Transaction Flow Dynamics for Fault Detection in Complex Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[3]  Haifeng Chen,et al.  Correlating real-time monitoring data for mobile network management , 2008, 2008 International Symposium on a World of Wireless, Mobile and Multimedia Networks.

[4]  Zhen Guo,et al.  Tracking Probabilistic Correlation of Monitoring Data for Fault Detection in Complex Systems , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[5]  J. Vetter,et al.  Managing Performance Analysis with Dynamic Statistical Projection Pursuit , 2000, ACM/IEEE SC 1999 Conference (SC'99).

[6]  Paul A. S. Ward,et al.  Leveraging many simple statistical models to adaptively monitor software systems , 2007, Int. J. High Perform. Comput. Netw..

[7]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[8]  R. Mantegna Hierarchical structure in financial markets , 1998, cond-mat/9802256.

[9]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[10]  Paul A. S. Ward,et al.  A comparative study of pairwise regression techniques for problem determination , 2007, CASCON.

[11]  Thomas Reidemeister,et al.  Automatic fault detection and diagnosis in complex software systems by information-theoretic monitoring , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[12]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[13]  Thomas Reidemeister,et al.  System monitoring with metric-correlation models: problems and solutions , 2009, ICAC '09.

[14]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[15]  G. Caldarelli,et al.  Networks of equities in financial markets , 2004 .

[16]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[17]  Paul A. S. Ward,et al.  ADAPTIVE MONITORING IN ENTERPRISE SOFTWARE SYSTEMS , 2006 .

[18]  Virgílio A. F. Almeida,et al.  Performance by Design - Computer Capacity Planning By Example , 2004 .

[19]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[20]  Ian T. Foster,et al.  Statistical data reduction for efficient application performance monitoring , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[21]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .