NestedMP: Enabling cache-aware thread mapping for nested parallel shared memory applications

Abstract It is beneficial to exploit multiple levels of parallelism for a wide range of applications, because a typical server already has tens of processor cores now. As the number of cores in a computer is increasing rapidly, efficient support of nested parallelism will be more and more important. We observe that different task-core mapping schemas may result significant performance difference because modern HPC servers are NUMA multi-core systems. So it is important to control the task-core mapping for nested parallelism. However, the number of threads management mechanism in current parallel programming models, such as OpenMP, does not provide enough information for runtime systems to make optimized decision. As a result, current nested parallel applications often suffer from suboptimal task-core mapping and get significant performance loss. To address this problem, we propose NestedMP, a set of directives which extends OpenMP. NestedMP specifies the number of threads of each nested parallel branch in a declarative way and allows runtime systems to see the whole picture of task trees to make locality-aware task-core mapping. We have implemented NestedMP in GCC 4.8.2 and tested the performance on a 4-way 8-core SandyBridge server. The result shows NestedMP improves the performance significantly over GCC’s OpenMP implementation.

[1]  Wenguang Chen,et al.  NestedMP: Taming Complex Configuration Space of Degree of Parallelism for Nested-Parallel Programs , 2014, 2014 43rd International Conference on Parallel Processing Workshops.

[2]  Haoqiang Jin,et al.  Performance Characteristics of the Multi-Zone NAS Parallel Benchmarks , 2004, IPDPS.

[3]  Samuel Thibault,et al.  Structuring the execution of OpenMP applications for multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[4]  Arch D. Robison,et al.  Intel® Threading Building Blocks (TBB) , 2011, Encyclopedia of Parallel Computing.

[5]  Dirk Schmidl,et al.  Binding Nested OpenMP Programs on Hierarchical Memory Architectures , 2010, IWOMP.

[6]  Eduard Ayguadé,et al.  Employing nested OpenMP for the parallelization of multi-zone computational fluid dynamics applications , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[7]  Eduard Ayguadé,et al.  NanosCompiler: supporting flexible multilevel parallelism exploitation in OpenMP , 2000 .

[8]  V. Strassen Gaussian elimination is not optimal , 1969 .

[9]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[10]  Andreas Wolf,et al.  A class of OpenMP applications involving nested parallelism , 2004, SAC '04.

[11]  Guansong Zhang Extending the OpenMP Standard for Thread Mapping and Grouping , 2006, IWOMP.

[12]  Mitsuhisa Sato,et al.  Performance Evaluation of OpenMP Applications with Nested Parallelism , 2000, LCR.

[13]  Vivek Sarkar,et al.  Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement , 2009, LCPC.

[14]  Tor Sørevik,et al.  Load balancing and OpenMP implementation of nested parallelism , 2005, Parallel Comput..

[15]  Christian Terboven,et al.  The Design of OpenMP Thread Affinity , 2012, IWOMP.