Gleipnir: A memory tracing and profiling tool

Embedded and high performance applications often require fine-tuning to improve their performance. This is achieved using analysis tools that provide insights into the application’s behavior. A common approach is to instrument the application code and observe its behavior during an actual execution on a target system. In this paper we describe a program profiling and tracing tool called Gleipnir. Gleipnir is built as a plug-in tool to a widely used binary instrumentation framework called Valgrind. Gleipnir can be used to trace memory accesses and associate each access with a specific program internal structure such as threads, functions, local, global, and dynamic data structures, and scalar variables.This ability makes Gleipnir a good candidate for advanced memory performance tuning. The data provided by Gleipnir may be used by trace-driven simulators, such as cache simulators to analyze accesses to data structure elements so that programmers can understand how the memory access patterns are impacting the execution time of the application. The programmer may then be able to change data layouts or reorder code to change the access patterns and eliminate performance bottlenecks. The goal of Gleipnir is to give information rich traces that can be used by any number of advanced memory analysis tools, particularly cache simulators. Our claim is that despite advances in allocation techniques and data reordering, detailed dynamic and static memory behavior of applications is often not readily available or available only in terms of statistical average accesses and cache miss rates. It is our hypothesis that optimizing cache performance at all levels is very important to improving the performance of applications running on singlecore and multi-core processors. In this paper we will describe Gleipnir and provide examples of on how the output of Gleipnir can be used by cache

[1]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[2]  Afrin Naz,et al.  Improving Uniformity of Cache Access Pattern using Split Data Caches , 2009, ISCA PDCCS.

[3]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[4]  Martin Schulz,et al.  Open|SpeedShop: open source performance analysis for Linux clusters , 2006 .

[5]  Nathan R. Tallent,et al.  HPCToolkit: performance tools for scientific computing , 2008 .

[6]  Krishna M. Kavi,et al.  Trace Driven Data Structure Transformations , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[7]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[8]  Krishna M. Kavi,et al.  International Conference on Computational Science, ICCS 2011 Gleipnir: A Memory Analysis Tool , 2011, ICCS.

[9]  B. Jacob,et al.  CMP $ im : A Pin-Based OnThe-Fly Multi-Core Cache Simulator , 2008 .

[10]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[11]  Josef Weidendorfer,et al.  A Tool Suite for Simulation Based Analysis of Memory Access Behavior , 2004, International Conference on Computational Science.

[12]  Jeffrey K. Hollingsworth,et al.  SIGMA: A Simulator Infrastructure to Guide Memory Analysis , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[13]  Nicholas Nethercote,et al.  "Building Workload Characterization Tools with Valgrind" , 2006, 2006 IEEE International Symposium on Workload Characterization.

[14]  Jeffrey K. Hollingsworth,et al.  An API for Runtime Code Patching , 2000, Int. J. High Perform. Comput. Appl..

[15]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[16]  Wolfgang Karl,et al.  A Profiling Tool for Detecting Cache-Critical Data Structures , 2007, Euro-Par.

[17]  Derek Bruening,et al.  Efficient, transparent, and comprehensive runtime code manipulation , 2004 .

[18]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[19]  Martin Schulz,et al.  Open|SpeedShop: open source performance analysis for Linux clusters , 2006 .