libKOMP, an Efficient OpenMP Runtime System for Both Fork-Join and Data Flow Paradigms

To efficiently exploit high performance computing platforms, applications currently have to express more and more finer-grain parallelism. The OpenMP standard allows programmers to do so since version 3.0 and the introduction of task parallelism. Even if this evolution stands as a necessary step towards scalability over shared memory machines holding hundreds of cores, the current specification of OpenMP lacks ways of expressing dependencies between tasks, forcing programmers to make unnecessary use of synchronization degrading overall performance. This paper introduces libKOMP, an OpenMP runtime system based on the X-Kaapi library that outperforms popular OpenMP implementations on current task-based OpenMP benchmarks, but also provides OpenMP programmers with new ways of expressing data-flow parallelism.

[1]  Jesper Larsson Träff,et al.  Euro-Par 2010 Parallel Processing Workshops - HeteroPar, HPCC, HiBB, CoreGrid, UCHPC, HPCF, PROPER, CCPI, VHPC, Ischia, Italy, August 31-September 3, 2010, Revised Selected Papers , 2011, Euro-Par Workshops.

[2]  Alejandro Duran,et al.  Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP , 2009, 2009 International Conference on Parallel Processing.

[3]  Thierry Gautier,et al.  The X-Kaapi's Application Programming Interface. Part I: Data Flow Programming , 2011 .

[4]  Stephen L. Olivier,et al.  Scheduling task parallelism on multi-socket multicore systems , 2011, ROSS '11.

[5]  Alejandro Duran,et al.  Evaluation of OpenMP Task Scheduling Strategies , 2008, IWOMP.

[6]  Jesús Labarta,et al.  Parallelizing dense and banded linear algebra libraries using SMPSs , 2009, Concurr. Comput. Pract. Exp..

[7]  Jack Dongarra,et al.  QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .

[8]  Barbara M. Chapman,et al.  A Runtime Implementation of OpenMP Tasks , 2011, IWOMP.

[9]  Emilio Luque,et al.  Euro-Par 2008 - Parallel Processing, 14th International Euro-Par Conference, Las Palmas de Gran Canaria, Spain, August 26-29, 2008, Proceedings , 2008, Euro-Par.

[10]  Jack Dongarra,et al.  Scheduling dense linear algebra operations on multicore processors , 2010 .

[11]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[12]  Maged M. Michael,et al.  Idempotent work stealing , 2009, PPoPP '09.

[13]  Bronis R. de Supinski,et al.  Evolving OpenMP in an Age of Extreme Parallelism, 5th International Workshop on OpenMP, IWOMP 2009, Dresden, Germany, June 3-5, 2009, Proceedings , 2009, IWOMP.

[14]  William Gropp,et al.  OpenMP in the Petascale Era - 7th International Workshop on OpenMP, IWOMP 2011, Chicago, IL, USA, June 13-15, 2011. Proceedings , 2011, IWOMP.

[15]  Alejandro Duran,et al.  A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures , 2009, IWOMP.

[16]  Jesús Labarta,et al.  Parallelizing dense and banded linear algebra libraries using SMPSs , 2009 .

[17]  Nir Shavit,et al.  Flat combining and the synchronization-parallelism tradeoff , 2010, SPAA '10.

[18]  Bogdan Dumitrescu,et al.  Two-dimensional block partitionings for the parallel sparse Cholesky factorization , 2004, Numerical Algorithms.

[19]  Thierry Gautier,et al.  X-Kaapi C programming interface , 2011 .

[20]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[21]  Michael Voss,et al.  Optimization via Reflection on Work Stealing in TBB , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[22]  Jérémie Allard,et al.  Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations , 2010, Euro-Par.

[23]  Nir Shavit,et al.  Non-blocking steal-half work queues , 2002, PODC '02.

[24]  Denis Trystram,et al.  A Tighter Analysis of Work Stealing , 2010, ISAAC.

[25]  Gerson G. H. Cavalheiro,et al.  Athapascan-1: On-line building data flow graph in a parallel language , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[26]  Allan Porterfield,et al.  OpenMP task scheduling strategies for multicore NUMA systems , 2012, Int. J. High Perform. Comput. Appl..

[27]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[28]  Thierry Gautier,et al.  KAAPI: A thread scheduling runtime system for data flow computations on cluster of multi-processors , 2007, PASCO '07.

[29]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[30]  Jack Dongarra,et al.  Dynamically scheduled Cholesky factorization on multicore architectures with GPU accelerators , 2010, HiPC 2010.

[31]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA '98.

[32]  Domenico Talia,et al.  Euro-Par 2010 - Parallel Processing , 2010, Lecture Notes in Computer Science.

[33]  Thierry Gautier,et al.  Fine Grain Distributed Implementation of a Dataflow Language with Provable Performances , 2007, International Conference on Computational Science.

[34]  Bruno Raffin,et al.  A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores , 2010, Euro-Par Workshops.

[35]  Spiros N. Agathos,et al.  Design and Implementation of OpenMP Tasks in the OMPi Compiler , 2011, 2011 15th Panhellenic Conference on Informatics.

[36]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[37]  Thierry Gautier,et al.  Deque-Free Work-Optimal Parallel STL Algorithms , 2008, Euro-Par.

[38]  Alejandro Duran,et al.  Extending the OpenMP Tasking Model to Allow Dependent Tasks , 2008, IWOMP.

[39]  Bronis R. de Supinski,et al.  OpenMP in a New Era of Parallelism, 4th International Workshop, IWOMP 2008, West Lafayette, IN, USA, May 12-14, 2008, Proceedings , 2008, IWOMP.