A data-centric profiler for parallel programs

It is difficult to manually identify opportunities for enhancing data locality. To address this problem, we extended the HPCToolkit performance tools to support data-centric profiling of scalable parallel programs. Our tool uses hardware counters to directly measure memory access latency and attributes latency metrics to both variables and instructions. Different hardware counters provide insight into different aspects of data locality (or lack thereof). Unlike prior tools for data-centric analysis, our tool employs scalable measurement, analysis, and presentation methods that enable it to analyze the memory access behavior of scalable parallel programs with low runtime and space overhead. We demonstrate the utility of HPCToolkit's new data-centric analysis capabilities with case studies of five well-known benchmarks. In each benchmark, we identify performance bottlenecks caused by poor data locality and demonstrate non-trivial performance optimizations enabled by this guidance.

[1]  Barton P. Miller,et al.  The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[2]  Vivien Quéma,et al.  MemProf: A Memory Profiler for NUMA Multicore Systems , 2012, USENIX Annual Technical Conference.

[3]  Barton P. Miller,et al.  Mapping performance data for high-level and data views of parallel program performance , 1996, ICS '96.

[4]  John M. Mellor-Crummey,et al.  Pinpointing data locality problems using data-centric analysis , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[5]  Kristof Beyls,et al.  Refactoring for Data Locality , 2009, Computer.

[6]  James C. Browne,et al.  Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[7]  Collin McCurdy,et al.  Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[8]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[9]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, ACM Trans. Comput. Syst..

[10]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[11]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[12]  Wentao Chang,et al.  Sampling-based program locality approximation , 2008, ISMM '08.

[13]  Nathan R. Tallent,et al.  Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Milind Kulkarni,et al.  Accelerating multicore reuse distance analysis with sampling and parallelization , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[15]  Nathan Froyd,et al.  Low-overhead call path profiling of unmodified, optimized code , 2005, ICS '05.

[16]  Brian J. N. Wylie,et al.  Memory Profiling using Hardware Counters , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[17]  Margaret Martonosi,et al.  MemSpy: analyzing memory system bottlenecks in programs , 1992, SIGMETRICS '92/PERFORMANCE '92.

[18]  John M. Mellor-Crummey,et al.  Pinpointing data locality bottlenecks with low overhead , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[19]  David A. Wood,et al.  Cache profiling and the SPEC benchmarks: a case study , 1994, Computer.

[20]  Kristof Beyls,et al.  Discovery of Locality-Improving Refactorings by Reuse Path Analysis , 2006, HPCC.

[21]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[22]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[23]  Jeffrey Dean,et al.  ProfileMe: hardware support for instruction-level profiling on out-of-order processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[24]  Balaram Sinharoy,et al.  IBM POWER7 performance modeling, verification, and evaluation , 2011 .

[25]  Jeffrey K. Hollingsworth,et al.  Data Centric Cache Measurement on the Intel ltanium 2 Processor , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[26]  Nathan R. Tallent,et al.  Binary analysis for measurement and attribution of program performance , 2009, PLDI '09.