Understanding memory access patterns using the BSC performance tools

The growing gap between processor and memory speeds results in complex memory hierarchies as processors evolve to mitigate such divergence by taking advantage of the locality of reference. In this direction, the BSC performance analysis tools have been recently extended to provide insight relative to the application memory accesses depicting their temporal and spatial characteristics, correlating with the source-code and the achieved performance simultaneously. These extensions rely on the Precise Event-Based Sampling (PEBS) mechanism available in recent Intel processors to capture information regarding the application memory accesses. The sampled information is later combined with the Folding technique to represent a detailed temporal evolution of the memory accesses and in conjunction with the achieved performance and the source-code counterpart. The results obtained from the combination of these tools help not only application developers but also processor architects to understand better how the application behaves and how the system performs. In this paper, we describe a tighter integration of the sampling mechanism into the monitoring package. We also demonstrate the value of the complete workflow by exploring already optimized state--of--the--art benchmarks, providing detailed insight of their memory access behavior. We have taken advantage of this insight to apply small modifications that improve the applications' performance.

[1]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[2]  Pavan Balaji,et al.  Toward the efficient use of multiple explicitly managed memory subsystems , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[3]  Jack J. Dongarra,et al.  High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems , 2016, Int. J. High Perform. Comput. Appl..

[4]  James C. Browne,et al.  Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  Lars Koesterke,et al.  PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Pavan Balaji,et al.  A Framework for Tracking Memory Accesses in Scientific Applications , 2014, 2014 43rd International Conference on Parallel Processing Workshops.

[7]  Jesús Labarta,et al.  Unveiling Internal Evolution of Parallel Application Computation Phases , 2011, 2011 International Conference on Parallel Processing.

[8]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[9]  Arnaldo Carvalho de Melo,et al.  The New Linux ’ perf ’ Tools , 2010 .

[10]  Mateo Valero,et al.  Quantifying the Potential Task-Based Dataflow Parallelism in MPI Applications , 2011, Euro-Par.

[11]  Laura Carrington,et al.  ADAMANT: Tools to Capture, Analyze, and Manage Data Movement , 2016, ICCS.

[12]  Xu Liu,et al.  StructSlim: A lightweight profiler to guide structure splitting , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[13]  Bernd Hamann,et al.  Dissecting On-Node Memory Access Performance: A Semantic Approach , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  John M. Mellor-Crummey,et al.  A data-centric profiler for parallel programs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[15]  Kristof Beyls,et al.  Refactoring for Data Locality , 2009, Computer.

[16]  Jesús Labarta,et al.  DiP: A Parallel Program Development Environment , 1996, Euro-Par, Vol. II.

[17]  Robert Richter,et al.  Incorporating Instruction-Based Sampling into AMD CodeAnalyst , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[18]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[19]  Juan Gonzalez,et al.  Low-Overhead Detection of Memory Access Patterns and Their Time Evolution , 2015, Euro-Par.

[20]  Jack J. Dongarra,et al.  A Portable Programming Interface for Performance Evaluation on Modern Processors , 2000, Int. J. High Perform. Comput. Appl..

[21]  Margaret Martonosi,et al.  MemSpy: analyzing memory system bottlenecks in programs , 1992, SIGMETRICS '92/PERFORMANCE '92.

[22]  Balaram Sinharoy,et al.  IBM POWER7 performance modeling, verification, and evaluation , 2011 .

[23]  Chao Wang,et al.  NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[24]  Brian J. N. Wylie,et al.  Memory Profiling using Hardware Counters , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[25]  Michael Laurenzano,et al.  PEBIL: Efficient static binary instrumentation for Linux , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).