NumaPerf: Predictive and Full NUMA Profiling

Parallel applications are extremely challenging to achieve the optimal performance on the NUMA architecture, which necessitates the assistance of profiling tools. However, existing NUMA-profiling tools share some similar shortcomings, such as portability, effectiveness, and helpfulness issues. This paper proposes a novel profiling tool–NumaPerf–that overcomes these issues. NumaPerf aims to identify potential performance issues for any NUMA architecture, instead of only on the current hardware. To achieve this, NumaPerf focuses on memory sharing patterns between threads, instead of real remote accesses. NumaPerf further detects potential thread migrations and load imbalance issues that could significantly affect the performance but are omitted by existing profilers. NumaPerf also separates cache coherence issues that may require different fix strategies. Based on our extensive evaluation, NumaPerf is able to identify more performance issues than any existing tool, while fixing these bugs leads to up to 5.94× performance speedup.

[1]  Vivien Quéma,et al.  MemProf: A Memory Profiler for NUMA Multicore Systems , 2012, USENIX Annual Technical Conference.

[2]  Emery D. Berger,et al.  Coz: finding code that counts with causal profiling , 2015, USENIX Annual Technical Conference.

[3]  Alexandra Fedorova,et al.  A case for NUMA-aware contention management on multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4]  Guangming Zeng,et al.  SyncPerf: Categorizing, Detecting, and Diagnosing Synchronization Performance Bugs , 2017, EuroSys.

[5]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[6]  Philippe Olivier Alexandre Navaux,et al.  Characterizing communication and page usage of parallel applications for thread and data mapping , 2015, Perform. Evaluation.

[7]  Manuel Selva,et al.  NumaMMA: NUMA MeMory Analyzer , 2018, ICPP.

[8]  Kenjiro Taura,et al.  PerfMemPlus: A Tool for Automatic Discovery of Memory Performance Problems , 2019, ISC.

[9]  Collin McCurdy,et al.  Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[10]  Weng-Fai Wong,et al.  Dynamic cache contention detection in multi-threaded applications , 2011, VEE '11.

[11]  Rui Yang,et al.  Profiling Directed NUMA Optimization on Linux Systems: A Case Study of the Gaussian Computational Chemistry Code , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[12]  Gokcen Kestor,et al.  RTHMS: a tool for data placement on hybrid memory system , 2017, ISMM.

[13]  Chen Tian,et al.  PREDATOR: predictive false sharing detection , 2014, PPoPP '14.

[14]  Emery D. Berger,et al.  SHERIFF: precise detection and automatic mitigation of false sharing , 2011, OOPSLA '11.

[15]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[16]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[17]  James C. Browne,et al.  Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[18]  Philippe Olivier Alexandre Navaux,et al.  TABARNAC: visualizing and resolving memory access issues on NUMA architectures , 2015, VPA '15.

[19]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[20]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[21]  Sébastien Valat,et al.  NUMAPROF, A NUMA Memory Profiler , 2018, Euro-Par Workshops.

[22]  Xu Liu,et al.  Cheetah: Detecting false sharing efficiently and effectively , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[23]  John M. Mellor-Crummey,et al.  A tool to analyze the performance of multithreaded programs on NUMA architectures , 2014, PPoPP '14.

[24]  Christian Bienia,et al.  PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors , 2009 .

[25]  Derek Bruening,et al.  An infrastructure for adaptive dynamic optimization , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[26]  Robert J. Fowler,et al.  NUMA policies and their relation to memory architecture , 1991, ASPLOS IV.

[27]  Hai Jin,et al.  A Tool to Detect Performance Problems of Multi-threaded Programs on NUMA Systems , 2016, 2016 IEEE Trustcom/BigDataSE/ISPA.

[28]  Derek Bruening,et al.  AddressSanitizer: A Fast Address Sanity Checker , 2012, USENIX Annual Technical Conference.

[29]  Christoph Lameter,et al.  An overview of non-uniform memory access , 2013, CACM.