A Comparison of Task Parallel Frameworks based on Implicit Dependencies in Multi-core Environments

The larger flexibility that task parallelism offers with respect to data parallelism comes at the cost of a higher complexity due to the variety of tasks and the arbitrary patterns of dependences that they can exhibit. These dependencies should be expressed not only correctly, but optimally, i.e. avoiding over-constraints, in order to obtain the maximum performance from the underlying hardware. There have been many proposals to facilitate this non-trivial task, particularly within the scope of nowadays ubiquitous multi-core architectures. A very interesting family of solutions because of their large scope of application, ease of use and potential performance are those in which the user declares the dependences of each task, and lets the parallel programming framework figure out which are the concrete dependences that appear at runtime and schedule accordingly the parallel tasks. Nevertheless, as far as we know, there are no comparative studies of them that help users identify their relative advantages. In this paper we describe and evaluate four tools of this class discussing the strengths and weaknesses we have found in their use. Keywords-programmability; task parallelism; dependencies; programming models

[1]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[2]  Katherine Yelick,et al.  Hierarchical Work Stealing on Manycore Clusters , 2011 .

[3]  Jack Dongarra,et al.  QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .

[4]  Gerson G. H. Cavalheiro,et al.  Athapascan-1: On-line building data flow graph in a parallel language , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[5]  L. Rauchwerger,et al.  The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization , 1999, IEEE Trans. Parallel Distributed Syst..

[6]  Timothy G. Mattson,et al.  How good is OpenMP , 2003, Sci. Program..

[7]  Claudia Fohry,et al.  Common Mistakes in OpenMP and How to Avoid Them - A Collection of Best Practices , 2005, IWOMP.

[8]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[9]  Alfredo Goldman,et al.  OpenMP is Not as Easy as It Appears , 2016, 2016 49th Hawaii International Conference on System Sciences (HICSS).

[10]  Robert A. van de Geijn,et al.  SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks , 2008, PPoPP.

[11]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[12]  Stephen L. Olivier,et al.  Comparison of OpenMP 3.0 and Other Task Parallel Frameworks on Unbalanced Task Graphs , 2010, International Journal of Parallel Programming.

[13]  Basilio B. Fraguela,et al.  A framework for argument-based task synchronization with automatic detection of dependencies , 2013, Parallel Comput..

[14]  Alejandro Duran,et al.  A Proposal to Extend the OpenMP Tasking Model with Dependent Tasks , 2009, International Journal of Parallel Programming.

[15]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[16]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[17]  Jack B. Dennis,et al.  Data Flow Supercomputers , 1980, Computer.

[18]  Monica S. Lam,et al.  Jade: a high-level, machine-independent language for parallel programming , 1993, Computer.

[19]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[20]  Peiyi Tang,et al.  Measuring the overhead of Intel C++ Concurrent Collections over Threading Building Blocks for Gauss–Jordan elimination , 2012, Concurr. Comput. Pract. Exp..

[21]  Maurice H. Halstead,et al.  Elements of software science , 1977 .

[22]  Eduard Ayguadé,et al.  Implementing OmpSs support for regions of data in architectures with multiple address spaces , 2013, ICS '13.

[23]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[24]  Richard W. Vuduc,et al.  Performance evaluation of concurrent collections on high-performance multicore computing systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[25]  Murray Cole,et al.  Parallel Skeletons , 2011, Encyclopedia of Parallel Computing.

[26]  Bruno Raffin,et al.  XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[27]  Murray Cole,et al.  Algorithmic Skeletons: Structured Management of Parallel Computation , 1989 .

[28]  Rita Loogen,et al.  Under Consideration for Publication in J. Functional Programming Parallel Functional Programming in Eden , 2022 .

[29]  Todd L. Veldhuizen,et al.  Arrays in Blitz++ , 1998, ISCOPE.

[30]  R. Doallo,et al.  Evaluation of UPC programmability using classroom studies , 2009, PGAS '09.

[31]  Vivek Sarkar,et al.  Concurrent Collections Programming Model , 2010, Encyclopedia of Parallel Computing.

[32]  Hans-Wolfgang Loidl,et al.  Seq no more: better strategies for parallel Haskell , 2010 .