Adaptive Runtime Filtering: Reducing Trace Size and Bias in Event-Based Performance Analysis

In this paper we address the problem of massive event trace sizes, one of the most urgent challenges in the performance analysis of large-scale parallel applications. Reducing trace sizes during the application runtime decreases application slow down, eliminates measurement bias, and cuts down stress on the underlying file system. Previous approaches use static filters to decrease trace size, which relies on preceding knowledge about the application or, otherwise, delivers poor results. In contrast, we propose runtime filters that automatically adapt to an application's runtime behavior and, therefore, do not require any prior knowledge. We present and compare four adaptive runtime filters: for regions that are leaf nodes in the call tree, for regions with similar duration, for activities within iterations, and for blocks of activities with repetitive behavior. We evaluate a prototype implementation of these filters based on the stateof-the-art trace collector Score-P and the Open Trace Format 2 trace library with five real-life applications and achieved a trace size reduction of up to two orders of magnitude and an additional overhead of less than one percent in average.

[1]  Michael Wagner,et al.  Open Trace Format 2: The Next Generation of Scalable Trace Formats and Support Libraries , 2011, PARCO.

[2]  Michael Wagner,et al.  Strategies for Real-Time Event Reduction , 2012, Euro-Par Workshops.

[3]  Dirk Schmidl,et al.  Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir , 2011, Parallel Tools Workshop.

[4]  Wolfgang E. Nagel,et al.  Towards Detailed Exascale Application Analysis - Selective Monitoring and Visualisation , 2014, EASC.

[5]  Jesús Labarta,et al.  Trace Spectral Analysis toward Dynamic Levels of Detail , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[6]  Michael Wagner,et al.  Tracing long running applications: A case study using Gromacs , 2015, 2015 International Conference on High Performance Computing & Simulation (HPCS).

[7]  Bernd Mohr,et al.  The Scalasca performance toolset architecture , 2010, Concurr. Comput. Pract. Exp..

[8]  Carsten Kutzner,et al.  GROMACS 4:  Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation. , 2008, Journal of chemical theory and computation.

[9]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[10]  B.P. Miller,et al.  MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[11]  Michael Wagner,et al.  Runtime message uniquification for accurate communication analysis on incomplete MPI event traces , 2013, EuroMPI.

[12]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[13]  Michael Wagner,et al.  Selective runtime monitoring: Non-intrusive elimination of high-frequency functions , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).

[14]  Michael Wagner,et al.  Enhanced Encoding Techniques for the Open Trace Format 2 , 2012, ICCS.

[15]  Michael Wagner,et al.  Hierarchical Memory Buffering Techniques for an In-Memory Event Tracing Extension to the Open Trace Format 2 , 2013, 2013 42nd International Conference on Parallel Processing.

[16]  Matthias S. Müller,et al.  The Vampir Performance Analysis Tool-Set , 2008, Parallel Tools Workshop.

[17]  Juan Gonzalez,et al.  On-line detection of large-scale parallel application's structure , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[18]  Wolfgang E. Nagel,et al.  Compressible memory data structures for event-based trace analysis , 2006, Future Gener. Comput. Syst..