Elastic and scalable tracing and accurate replay of non-deterministic events

SCALATRACE represents the state-of-the-art of parallel application tracing for high performance computing (HPC). This paper presents SCALATRACE II, a next generation tracer that delivers even higher trace compression capability, even when events are not always regular. In this work, we contribute a spectrum of novel compression and replay techniques that are fundamentally different from our past approaches. SCALATRACE II features a redesigned low-level encoding scheme of trace data such that data elements are elastic and self explanatory. With this new encoding scheme, trace compression is enhanced by introducing innovative intra-node and inter-node trace compression algorithms that guarantee high compression rates in a loop structure agnostic fashion. In practice, the improved compression scheme is particularly efficient for scientific codes that demonstrate inconsistent behavior across time steps and nodes. A novel approach is further contributed to probabilistically replay sequences of non-deterministic events. To assess the compression efficacy of SCALATRACE II, we conduct experiments not only with computational kernels but also a real-world application, the Parallel Ocean Program (POP). Compared to the first generation SCALATRACE, we observe key improvements on trace compression for benchmarks with inconsistent time step behavior and diverging task level behavior while retaining timing accuracy even under probabilistic replay.

[1]  J. Larus Whole program paths , 1999, PLDI '99.

[2]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[3]  Wenguang Chen,et al.  MPIWiz: subgroup reproducible replay of mpi applications , 2009, PPoPP '09.

[4]  Jeffrey S. Vetter,et al.  Statistical scalability analysis of communication operations in distributed applications , 2001, PPoPP '01.

[5]  Qiang Xu,et al.  Logicalization ' ' of MPI Communication Traces , 2008 .

[6]  Frank Mueller,et al.  ScalaBenchGen: Auto-Generation of Communication Benchmarks Traces , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[7]  Qiang Xu,et al.  Construction and evaluation of coordinated performance skeletons , 2008, HiPC'08.

[8]  Bernd Mohr,et al.  The Scalasca performance toolset architecture , 2010, Concurr. Comput. Pract. Exp..

[9]  Jeffrey K. Hollingsworth,et al.  SIGMA: A Simulator Infrastructure to Guide Memory Analysis , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[10]  Scott Pakin,et al.  Automatic Generation of Executable Communication Specifications from Parallel Applications , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[11]  Sally A. McKee,et al.  METRIC: tracking down inefficiencies in the memory hierarchy via binary rewriting , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[12]  Adolfy Hoisie,et al.  Performance and Scalability Analysis of Teraflop-Scale Parallel Architectures Using Multidimensional Wavefront Applications , 2000, Int. J. High Perform. Comput. Appl..

[13]  Frank Mueller,et al.  Memory Trace Compression and Replay for SPMD Systems using Extended PRSDs? , 2011, PERV.

[14]  Martin Schulz,et al.  Preserving time in large-scale communication traces , 2008, ICS '08.

[15]  E. N. Elnozahy Address trace compression through loop detection and reduction , 1999, SIGMETRICS '99.

[16]  Sriram Krishnamoorthy,et al.  Scalable Communication Trace Compression , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[17]  Robert J. Fowler,et al.  Scalable methods for monitoring and detecting behavioral equivalence classes in scientific codes , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[18]  Dieter Kranzlmüller,et al.  Rolt/sup MP/-replay of Lamport timestamps for message passing systems , 1998, Proceedings of the Sixth Euromicro Workshop on Parallel and Distributed Processing - PDP '98 -.

[19]  Wolfgang E. Nagel,et al.  Construction and compression of complete call graphs for post-mortem program trace analysis , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[20]  Ken Kennedy,et al.  An Implementation of Interprocedural Bounded Regular Section Analysis , 1991, IEEE Trans. Parallel Distributed Syst..

[21]  Xing Wu,et al.  Probabilistic Communication and I/O Tracing with Deterministic Replay at Scale , 2011, 2011 International Conference on Parallel Processing.

[22]  Nathan Froyd,et al.  Low-overhead call path profiling of unmodified, optimized code , 2005, ICS '05.

[23]  Martin Schulz,et al.  Scalable load-balance measurement for SPMD codes , 2008, HiPC 2008.

[24]  Bronis R. de Supinski,et al.  A hybrid hardware/software approach to efficiently determine cache coherence Bottlenecks , 2005, ICS '05.

[25]  Toni Cortes,et al.  PARAVER: A Tool to Visualize and Analyze Parallel Code , 2007 .

[26]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs http://hpctoolkit.org , 2010 .

[27]  Martin Burtscher,et al.  VPC3: a fast and effective trace-compression algorithm , 2004, SIGMETRICS '04/Performance '04.

[28]  Martin Schulz,et al.  Clustering performance data efficiently at massive scales , 2010, ICS '10.

[29]  Wolfgang E. Nagel,et al.  Introducing the Open Trace Format (OTF) , 2006, International Conference on Computational Science.

[30]  Ian H. Witten,et al.  Linear-time, incremental hierarchy inference for compression , 1997, Proceedings DCC '97. Data Compression Conference.

[31]  Sally A. McKee,et al.  METRIC: Memory tracing via dynamic binary rewriting to identify cache inefficiencies , 2007, TOPL.

[32]  Craig G. Nevill-Manning,et al.  Compression and Explanation Using Hierarchical Grammars , 1997, Comput. J..

[33]  Wenguang Chen,et al.  PHANTOM: predicting performance of parallel applications on large-scale parallel machines using a single node , 2010, PPoPP '10.

[34]  Frank Mueller,et al.  ScalaExtrap: trace-based communication extrapolation for spmd programs , 2011, PPoPP '11.

[35]  Martin Schulz,et al.  ScalaTrace: Scalable compression and replay of communication traces for high-performance computing , 2008, J. Parallel Distributed Comput..

[36]  Wenguang Chen,et al.  FACT: fast communication trace collection for parallel applications through program slicing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[37]  James W. Hurrell The CommuniTy earTh SySTem model , 2013 .

[38]  Interner Bericht VAMPIR: Visualization and Analysis of MPI Resources , 1996 .

[39]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..