Scalable methods for monitoring and detecting behavioral equivalence classes in scientific codes

Emerging petascale systems will have many hundreds of thousands of processors, but traditional task-level tracing tools already fail to scale to much smaller systems because the I/O backbones of these systems cannot handle the peak load offered by their cores. Complete event traces of all processes are thus infeasible. To retain the benefits of detailed performance measurement while reducing volume of collected data, we developed AMPL, a general-purpose toolkit that reduces data volume using stratified sampling. We adopt a scalable sampling strategy, since the sample size required to measure a system varies sub-linearly with process count. By grouping, or stratifying, processes that behave similarly, we can further reduce data overhead while also providing insight into an application's behavior. In this paper, we describe the AMPL toolkit and we report our experiences using it on large-scale scientific applications. We show that AMPL can successfully reduce the overhead of tracing scientific applications by an order of magnitude or more, and we show that our tool scales sub-linearly, so the improvement will be more dramatic on petascale machines. Finally, we illustrate the use of AMPL to monitor applications by performance-equivalent strata, and we show that this technique can allow for further reductions in trace data volume and traced execution time.

[1]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[2]  F. Mueller,et al.  Scalable Compression and Replay of Communication Traces in Massively P arallel E nvironments , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[3]  R. I. Klein,et al.  An unsplit, cell-centered Godunov method for ideal MHD , 2005 .

[4]  Philip C. Roth,et al.  Real-Time Statistical Clustering for Event Trace Reduction , 1997, Int. J. High Perform. Comput. Appl..

[5]  Norman W. Scheffner,et al.  ADCIRC: An Advanced Three-Dimensional Circulation Model for Shelves, Coasts, and Estuaries. Report 1. Theory and Methodology of ADCIRC-2DDI and ADCIRC-3DL. , 1992 .

[6]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[7]  Jack Dongarra,et al.  TOP500 Supercomputer sites 11/2000 , 2000 .

[8]  Jack J. Dongarra,et al.  A Portable Programming Interface for Performance Evaluation on Modern Processors , 2000, Int. J. High Perform. Comput. Appl..

[9]  Andreas Karlsson,et al.  Elementary Survey Sampling , 2007, Technometrics.

[10]  Daniel A. Reed,et al.  Monitoring Large Systems Via Statistical Sampling , 2004, Int. J. High Perform. Comput. Appl..

[11]  Sadaf R. Alam,et al.  Early evaluation of the Cray XT3 , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[12]  Barton P. Miller,et al.  On-line automated performance diagnosis on thousands of processes , 2006, PPoPP '06.

[13]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[14]  Richard L. Scheaffer,et al.  Elementary Survey Sampling , 1971 .

[15]  George L.-T. Chiu,et al.  Overview of the Blue Gene/L system architecture , 2005, IBM J. Res. Dev..

[16]  M. Haselton,et al.  Do representations of male muscularity differ in men's and women's magazines? , 2005, Body image.

[17]  Charng-Da Lu,et al.  Compact Application Signatures for Parallel and Distributed Scientific Codes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[18]  Martin Schulz,et al.  Scalable compression and replay of communication traces in massively parallel environments , 2006, SC.

[19]  B.P. Miller,et al.  MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools , 2003, ACM/IEEE SC 2003 Conference (SC'03).