TABARNAC: Tools for Analyzing Behavior of Applications Running on NUMA Architecture

In modern parallel architectures, memory accesses represent a common bottleneck. Thus, optimizing the way applications access the memory is an important way to improve performance and energy consumption. Memory accesses are even more important with NUMA machines, as the access time to data depends on its location in the memory. Many efforts were made to develop adaptive tools to improve memory accesses at the runtime by optimizing the mapping of data and threads to NUMA nodes. However, theses tools are not able to change the memory access pattern of the original application, therefore a code written without considering memory performance might not benefit from them. Moreover, automatic mapping tools take time to converge towards the best mapping, losing optimization opportunities. A deeper understanding of the memory behavior can help optimizing it, removing the need for runtime analysis. In this paper, we present TABARNAC, a tool for analyzing the memory behavior of parallel applications with a focus on NUMA architectures. TABARNAC provides a new visualization of the memory access behavior, focusing on the distribution of accesses by thread and by structure. Such visualization allows the developer to easily understand why performance issues occur and how to fix them. Using TABARNAC, we explain why some applications do not benefit from data and thread mapping. Moreover, we propose several code modifications to improve the memory access behavior of several parallel applications.

[1]  Sally A. McKee,et al.  Understanding the behavior of in-memory computing workloads , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[2]  Thomas R. Gross,et al.  (Mis)understanding the NUMA memory system performance of multithreaded workloads , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[3]  Bernd Hamann,et al.  Dissecting On-Node Memory Access Performance: A Semantic Approach , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Margaret Martonosi,et al.  MemSpy: analyzing memory system bottlenecks in programs , 1992, SIGMETRICS '92/PERFORMANCE '92.

[5]  Martin Schulz,et al.  Visualizing the Memory Access Behavior of Shared Memory Applications on NUMA Architectures , 2001, International Conference on Computational Science.

[6]  Pat Hanrahan,et al.  Rivet: a flexible environment for computer systems visualization , 2000, SIGGRAPH 2000.

[7]  David W. Nellans,et al.  Handling the problems and opportunities posed by multiple on-chip memory controllers , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[8]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[9]  Michael Frumkin,et al.  The OpenMP Implementation of NAS Parallel Benchmarks and its Performance , 2013 .

[10]  Jeffrey K. Hollingsworth,et al.  SIGMA: A Simulator Infrastructure to Guide Memory Analysis , 2002, ACM/IEEE SC 2002 Conference (SC'02).