OpenMP Task Scheduling Analysis via OpenMP Runtime API and Tool Visualization

OpenMP tasks propose a new dimension of concurrency to cap irregular parallelism within applications. The addition of OpenMP tasks allows programmers to express concurrency at a high level of abstraction and makes the OpenMP runtime responsible about the burden of scheduling parallel execution. The ability to observe the performance of OpenMP task scheduling strategies portably across shared memory platforms has been a challenge due to the lack of performance interface standards in the runtime layer. In this paper, we exploit our proposed tasking extensions to the OpenMP Runtime API (ORA), Known as Collector APIs, for profiling task level parallelism. We describe the integration of these Collector APIs, implemented in the OpenUH compiler, into the TAU performance system. Our proposed task extensions are in line with the new interface specification called OMPT, which is currently under evaluation by the OpenMP community. We use this integration to analyze various OpenMP task scheduling strategies implemented in OpenUH. The capabilities of these scheduling strategies are evaluated with respect to exploiting data locality, maintaining load balance, and minimizing overhead costs. We present a comprehensive performance study of diverse OpenMP benchmarks, from the Barcelona OpenMP Test Suite, comparing different task pools (DEFAULT, SIMPLE, SIMPLE_2LEVEL, PUBLIC PRIVATE), task queues (DEQUE, FIFO, CFIFO, LIFO, INV_DEQUE), and task queue storages (ARRAY, DYN_ARRAY, LIST, LOCKLESS) on an AMD Opteron multicore system (48 cores total). Our results show that the benchmarks with similar characteristics exhibit the same behavior in terms of the performance of the applied scheduling strategies. Moreover, the task pool configuration, which controls the organization of task queues, was found to have the highest impact on performance.

[1]  Matthias S. Müller,et al.  The Vampir Performance Analysis Tool-Set , 2008, Parallel Tools Workshop.

[2]  Bernd Mohr,et al.  Design and Prototype of a Performance Tool Interface for OpenMP , 2002, The Journal of Supercomputing.

[3]  Alejandro Duran,et al.  Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP , 2009, 2009 International Conference on Parallel Processing.

[4]  Bernd Mohr,et al.  The Scalasca performance toolset architecture , 2010, Concurr. Comput. Pract. Exp..

[5]  Barbara M. Chapman,et al.  Open Source Software Support for the OpenMP Runtime API for Profiling , 2009, 2009 International Conference on Parallel Processing Workshops.

[6]  Barbara M. Chapman,et al.  A Compiler-Based Tool for Array Analysis in HPC Applications , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[7]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[8]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[9]  Bernd Mohr,et al.  KOJAK - A Tool Set for Automatic Performance Analysis of Parallel Programs , 2003, Euro-Par.

[10]  Barbara M. Chapman,et al.  Open Source Task Profiling by Extending the OpenMP Runtime API , 2013, IWOMP.

[11]  Barbara M. Chapman,et al.  Experiences Developing the OpenUH Compiler and Runtime Infrastructure , 2013, International Journal of Parallel Programming.

[12]  Barbara M. Chapman,et al.  Towards an Implementation of the OpenMP Collector API , 2007, PARCO.

[13]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[14]  Xinmin Tian,et al.  Compiler support of the workqueuing execution model for Intel SMP architectures , 2002 .

[15]  Alejandro Duran,et al.  An adaptive cut-off for task parallelism , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.