An optimized task-based runtime system for resource-constrained parallel accelerators

Manycore accelerators have recently proven a promising solution for increasingly powerful and energy efficient computing systems. This raises the need for parallel programming models capable of effectively leveraging hundreds to thousands of processors. Task-based parallelism has the potential to provide such capabilities, offering flexible support to fine-grained and irregular parallelism. However, efficiently supporting this programming paradigm on resource-constrained parallel accelerators is a challenging task. In this paper, we present an optimized implementation of the OpenMP tasking model for embedded parallel accelerators, discussing the key design solution that guarantee small memory (footprint) and minimize performance overheads. We validate our design by comparing to several state-of-the-art tasking implementations, using the most representative parallelization patterns. The experimental results confirm that our solution achieves near-ideal speedups for tasks as small as 5K cycles.

[1]  Spiros N. Agathos,et al.  Design and Implementation of OpenMP Tasks in the OMPi Compiler , 2011, 2011 15th Panhellenic Conference on Informatics.

[2]  Alejandro Duran,et al.  An adaptive cut-off for task parallelism , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Barbara M. Chapman,et al.  Implementing OpenMP on a high performance embedded multicore MPSoC , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[4]  Mats Brorsson,et al.  A comparative performance study of common and popular task‐centric programming frameworks , 2015, Concurr. Comput. Pract. Exp..

[5]  Alejandro Duran,et al.  Evaluation of OpenMP Task Scheduling Strategies , 2008, IWOMP.

[6]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[7]  Cheng Wang,et al.  libEOMP: a portable OpenMP runtime library based on MCA APIs for embedded systems , 2013, PMAM '13.

[8]  Kazuki Sakamoto,et al.  Grand Central Dispatch , 2012 .

[9]  Luca Benini,et al.  Simplifying Many-Core-Based Heterogeneous SoC Programming With Offload Directives , 2015, IEEE Transactions on Industrial Informatics.

[10]  Karl-Filip Faxén,et al.  Wool-A work stealing library , 2008, CARN.

[11]  Luca Benini,et al.  VirtualSoC: A Full-System Simulation Environment for Massively Parallel Heterogeneous System-on-Chip , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[12]  Luca Benini,et al.  Platform 2012, a many-core computing accelerator for embedded SoCs: Performance evaluation of visual analytics applications , 2012, DAC Design Automation Conference 2012.

[13]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[14]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[15]  Luca Benini,et al.  Enabling fine-grained OpenMP tasking on tightly-coupled shared memory clusters , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[16]  Chris D. Marlin Coroutines: A Programming Methodology, a Language Design and an Implementation , 1980, Lecture Notes in Computer Science.

[17]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.