Statistical data reduction for efficient application performance monitoring

There is a growing need for systems that can monitor and analyze application performance data automatically in order to deliver reliable and sustained performance to applications. However, the continuously growing complexity of high performance computer systems and applications makes this process difficult. We introduce a statistical data reduction method that can be used to guide the selection of system metrics that are both necessary and sufficient to describe observed application behavior, thus reducing the instrumentation perturbation and data volume to be managed. To evaluate our strategy, we applied it to one CPU-bound grid application using cluster machines and GridFTP data transfer in a wide area testbed. A comparative study shows that our strategy produces better results than other techniques. It can reduce the number of system metrics to be managed by about 80%, while still capturing enough information for performance predictions.

[1]  A. K. Jain,et al.  A guideline to statistical approaches in computer performance evaluation studies , 1978, PERV.

[2]  Richard Wolski,et al.  Dynamically forecasting network performance using the Network Weather Service , 1998, Cluster Computing.

[3]  Armando Fox,et al.  Ensembles of models for automated diagnosis of system performance problems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[4]  Jason Lee,et al.  The Grid2003 production grid: principles and practice , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[5]  Holly Dail,et al.  A Modular Framework for Adaptive Scheduling in Grid Application Development Environments , 2002 .

[6]  P. Schellhammer,et al.  Data Reduction Using a Discrete Wavelet Transform in Discriminant Analysis of Very High Dimensionality Data , 2003, Biometrics.

[7]  J. Vetter,et al.  Managing Performance Analysis with Dynamic Statistical Projection Pursuit , 2000, ACM/IEEE SC 1999 Conference (SC'99).

[8]  G. Allen,et al.  The Cactus Code: a problem solving environment for the grid , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[9]  Adolfy Hoisie,et al.  Performance and Scalability Analysis of Teraflop-Scale Parallel Architectures Using Multidimensional Wavefront Applications , 2000, Int. J. High Perform. Comput. Appl..

[10]  Ian T. Foster,et al.  The Globus project: a status report , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).

[11]  David E. Culler,et al.  Operating Systems Support for Planetary-Scale Network Services , 2004, NSDI.

[12]  Peter A. Dinda,et al.  Windows Performance Monitoring and Data Reduction Using WatchTower , 2001 .

[13]  Allen D. Malony,et al.  Performance Measurement Intrusion and Perturbation Analysis , 1992, IEEE Trans. Parallel Distributed Syst..

[14]  Henri Casanova,et al.  A decoupled scheduling approach for Grid application development environments , 2003, J. Parallel Distributed Comput..

[15]  Ian T. Foster,et al.  Performance Predictions for a Numerical Relativity Package in Grid Environments , 2001, Int. J. High Perform. Comput. Appl..

[16]  Philip C. Roth,et al.  Real-Time Statistical Clustering for Event Trace Reduction , 1997, Int. J. High Perform. Comput. Appl..

[17]  James D. Nichols,et al.  Monitoring of biological diversity in space and time , 2001 .

[18]  Michel Verleysen,et al.  Input data reduction for the prediction of financial time series , 2001, ESANN.

[19]  Jarek Nabrzyski,et al.  Grid resource management: state of the art and future trends , 2004 .

[20]  Jennifer M. Schopf,et al.  Using Disk Throughput Data in Predictions of End-to-End Grid Data Transfers , 2002, GRID.

[21]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[22]  Carl Kesselman,et al.  High-Performance Remote Access to Climate Simulation Data: A Challenge Problem for Data Grid Technologies , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[23]  Fabrizio Petrini,et al.  Predictive Performance and Scalability Modeling of a Large-Scale Application , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[24]  D.A. Reed,et al.  Scalable performance analysis: the Pablo performance analysis environment , 1993, Proceedings of Scalable Parallel Libraries Conference.

[25]  Sandhya Dwarkadas,et al.  Characterizing and predicting program behavior and its variability , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[26]  Daniel A. Reed,et al.  Monitoring Large Systems Via Statistical Sampling , 2004, Int. J. High Perform. Comput. Appl..

[27]  Chuang Liu,et al.  Design and evaluation of a resource selection framework for Grid applications , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[28]  John Shalf,et al.  Cactus Tools for Grid Applications , 2001, Cluster Computing.