Towards Unifying OpenMP Under the Task-Parallel Paradigm - Implementation and Performance of the taskloop Construct

OpenMP 4.5 introduced a task-parallel version of the classical thread-parallel for-loop construct: the taskloop construct. With this new construct, programmers are given the opportunity to choose between the two parallel paradigms to parallelize their for loops. However, it is unclear where and when the two approaches should be used when writing efficient parallel applications.

[1]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[2]  L.M. Ni,et al.  Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers , 1993, IEEE Trans. Parallel Distributed Syst..

[3]  Basilio B. Fraguela,et al.  A Generic Algorithm Template for Divide-and-Conquer in Multicore Systems , 2010, 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC).

[4]  Martin Schulz,et al.  Scalable Critical-Path Based Performance Analysis , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[5]  Guy E. Blelloch,et al.  The data locality of work stealing , 2000, SPAA.

[6]  Artur Podobas,et al.  Using Transactional Memory to Avoid Blocking in OpenMP Synchronization Directives - Don't Wait, Speculate! , 2015, IWOMP.

[7]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[8]  Michael Voss,et al.  Runtime empirical selection of loop schedulers on hyperthreaded SMPs , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[9]  CONSTANTINE D. POLYCHRONOPOULOS,et al.  Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[10]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, IEEE Trans. Parallel Distributed Syst..

[11]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[12]  Seth Copen Goldstein,et al.  Lazy Threads: Implementing a Fast Parallel Call , 1996, J. Parallel Distributed Comput..

[13]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[14]  Mihai Burcea,et al.  An Adaptive OpenMP Loop Scheduler for Hyperthreaded SMPs , 2004, PDCS.

[15]  Rudolf Eigenmann,et al.  SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance , 2001, WOMPAT.

[16]  Piyush Kumar Cache Oblivious Algorithms , 2002, Algorithms for Memory Hierarchies.

[17]  Vladimir Vlassov,et al.  TurboBŁYSK: Scheduling for Improved Data-Driven Task Performance with Fast Dependency Resolution , 2014, IWOMP.