Memory Trace Compression and Replay for SPMD Systems using Extended PRSDs?

Concurrency levels in large-scale supercomputers are rising exponentially, and shared-memory nodes with hundreds of cores and non-uniform memory access latencies are expected within the next decade. However, even current petascale systems with tens of cores per node suffer from memory bottlenecks. As core counts increase, memory issues become critical for the performance of large-scale supercomputers. Trace analysis tools are vital for diagnosing the root causes of memory problems. However, existing tools are expensive due to prohibitively large trace sizes, or they collect only statistical summaries that omit valuable information. In this paper, we present ScalaMemTrace, a novel technique for collecting memory traces in a scalable manner. ScalaMemTrace builds on prior trace methods with aggressive compression techniques to allow lossless representation of memory traces for dense algebraic kernels, with nearconstant trace size irrespective of the problem size or the number of threads. We further introduce a replay mechanism for ScalaMemTrace traces, and discuss the results of our prototype implementation on the x86 64 architecture.

[1]  Karthik Vijayakumar,et al.  Scalable I/O tracing and analysis , 2009, PDSW '09.

[2]  Martin Schulz,et al.  Preserving time in large-scale communication traces , 2008, ICS '08.

[3]  Frank Mueller,et al.  Source-Code-Correlated Cache Coherence Characterization of OpenMP Benchmarks , 2007, IEEE Transactions on Parallel and Distributed Systems.

[4]  Maged M. Michael,et al.  Accuracy and speed-up of parallel trace-driven architectural simulation , 1997, Proceedings 11th International Parallel Processing Symposium.

[5]  Trishul M. Chilimbi Efficient representations and abstractions for quantifying and exploiting data reference locality , 2001, PLDI '01.

[6]  Jesús Labarta,et al.  Validation of Dimemas Communication Model for MPI Collective Operations , 2000, PVM/MPI.

[7]  Sally A. McKee,et al.  METRIC: Memory tracing via dynamic binary rewriting to identify cache inefficiencies , 2007, TOPL.

[8]  Jesús Labarta,et al.  Analyzing Scheduling Policies Using Dimemas , 1997, Parallel Comput..

[9]  Erich Strohmaier,et al.  Quantifying Locality In The Memory Access Patterns of HPC Applications , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[10]  Bronis R. de Supinski,et al.  A hybrid hardware/software approach to efficiently determine cache coherence Bottlenecks , 2005, ICS '05.

[11]  英晴 天野,et al.  20世紀の名著名論:J. L. Hennessy and D. A. Patterson : Computer Architecture : A Quantitative Approach , 2003 .

[12]  Jeffrey K. Hollingsworth,et al.  SIGMA: A Simulator Infrastructure to Guide Memory Analysis , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[13]  Calmet Meteorological Model A User's Guide for the , 1999 .

[14]  Sally A. McKee,et al.  METRIC: tracking down inefficiencies in the memory hierarchy via binary rewriting , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[15]  E. N. Elnozahy Address trace compression through loop detection and reduction , 1999, SIGMETRICS '99.

[16]  Anita Nagarajan,et al.  Detailed cache coherence characterization for OpenMP benchmarks , 2004, ICS '04.

[17]  Keith D. Underwood,et al.  The structural simulation toolkit: exploring novel architectures , 2006, SC.

[18]  Ken Kennedy,et al.  An Implementation of Interprocedural Bounded Regular Section Analysis , 1991, IEEE Trans. Parallel Distributed Syst..

[19]  David H. Bailey,et al.  Performance Modeling: Understanding the Past and Predicting the Future , 2005, Euro-Par.

[20]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[21]  Dee A. B. Weikle,et al.  Caches as filters: a framework for the analysis of caching systems , 2001 .

[22]  Martin Schulz,et al.  ScalaTrace: Scalable compression and replay of communication traces for high-performance computing , 2008, J. Parallel Distributed Comput..

[23]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[24]  Martin Burtscher,et al.  VPC3: a fast and effective trace-compression algorithm , 2004, SIGMETRICS '04/Performance '04.

[25]  Toni Cortes,et al.  PARAVER: A Tool to Visualize and Analyze Parallel Code , 2007 .

[26]  Eric E. Johnson,et al.  RATCHET: real-time address trace compression hardware for extended traces , 1994, PERV.

[27]  Sally A. McKee,et al.  Efficiently exploring architectural design spaces via predictive modeling , 2006, ASPLOS XII.

[28]  Martin Schulz,et al.  Scalable compression and replay of communication traces in massively parallel environments , 2006, SC.

[29]  Martin Hirzel,et al.  Dynamic hot data stream prefetching for general-purpose programs , 2002, PLDI '02.

[30]  Robert J. Fowler,et al.  Scalable methods for monitoring and detecting behavioral equivalence classes in scientific codes , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[31]  Jesús Labarta,et al.  A Framework for Performance Modeling and Prediction , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[32]  Bronis R. de Supinski,et al.  Analysis of cache-coherence bottlenecks with hybrid hardware/software techniques , 2006, TACO.

[33]  Sally A. McKee,et al.  Caches As Filters: A Unifying Model for Memory Hierarchy Analysis , 2000 .