Structuring the execution of OpenMP applications for multicore architectures

The now commonplace multi-core chips have introduced, by design, a deep hierarchy of memory and cache banks within parallel computers as a tradeoff between the user friendliness of shared memory on the one side, and memory access scalability and efficiency on the other side. However, to get high performance out of such machines requires a dynamic mapping of application tasks and data onto the underlying architecture. Moreover, depending on the application behavior, this mapping should favor cache affinity, memory bandwidth, computation synchrony, or a combination of these. The great challenge is then to perform this hardware-dependent mapping in a portable, abstract way. To meet this need, we propose a new, hierarchical approach to the execution of OpenMP threads onto multicore machines. Our ForestGOMP runtime system dynamically generates structured trees out of OpenMP programs. It collects relationship information about threads and data as well. This information is used together with scheduling hints and hardware counter feedback by the scheduler to select the most appropriate threads and data distribution. ForestGOMP features a highlevel platform for developing and tuning portable threads schedulers. We present several applications for which we developed specific scheduling policies that achieve excellent speedups on 16-core machines.

[1]  Dirk Schmidl,et al.  Data and thread affinity in openmp programs , 2008, MAW '08.

[2]  Samuel Thibault,et al.  Scheduling Dynamic OpenMP Applications over Multicore Architectures , 2008, IWOMP.

[3]  Barbara M. Chapman,et al.  Achieving performance under OpenMP on ccNUMA and software distributed shared memory systems , 2002, Concurr. Comput. Pract. Exp..

[4]  Eduard Ayguadé,et al.  Employing nested OpenMP for the parallelization of multi-zone computational fluid dynamics applications , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[5]  Eduard Ayguadé,et al.  User-level dynamic page migration for multiprogrammed shared-memory multiprocessors , 2000, Proceedings 2000 International Conference on Parallel Processing.

[6]  Dieter an Mey,et al.  Performance Evaluation of a Multi-Zone Application in Different OpenMP Approaches , 2008, International Journal of Parallel Programming.

[7]  Samuel Thibault,et al.  Building Portable Thread Schedulers for Hierarchical Multiprocessors: The BubbleSched Framework , 2007, Euro-Par.

[8]  Mitsuhisa Sato,et al.  Performance Evaluation of OpenMP Applications with Nested Parallelism , 2000, LCR.

[9]  Siegfried Benkner,et al.  Efficient parallel programming on scalable shared memory systems with High Performance Fortran , 2002, Concurr. Comput. Pract. Exp..

[10]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[11]  Eduard Ayguadé,et al.  Thread fork/join techniques for multi-level parallelism exploitation in NUMA multiprocessors , 1999, ICS '99.

[12]  Sverker Holmgren,et al.  affinity-on-next-touch: increasing the performance of an industrial PDE solver on a cc-NUMA system , 2005, ICS '05.

[13]  Bratin Saha,et al.  Runtime Environment for Terascale Platforms , 2007 .

[14]  Richard L. Hudson,et al.  Runtime Environment for Tera-scale Platforms , 2007 .

[15]  Brice Goglin,et al.  Dynamic Task and Data Placement over NUMA Architectures: An OpenMP Runtime Perspective , 2009, IWOMP.