Unleashing Fine-Grained Parallelism on Embedded Many-Core Accelerators with Lightweight OpenMP Tasking

In recent years, programmable many-core accelerators (PMCAs) have been introduced in embedded systems to satisfy stringent performance/Watt requirements. This has increased the urge for programming models capable of effectively leveraging hundreds to thousands of processors. Task-based parallelism has the potential to provide such capabilities, offering high-level abstractions to outline abundant and irregular parallelism in embedded applications. However, efficiently supporting this programming paradigm on embedded PMCAs is challenging, due to the large time and space overheads it introduces. In this paper we describe a lightweight OpenMP tasking runtime environment (RTE) design for a state-of-the-art embedded PMCA, the Kalray MPPA 256. We provide an exhaustive characterization of the costs of our RTE, considering both synthetic workload and real programs, and we compare to several other tasking RTEs. Experimental results confirm that our solution achieves near-ideal parallelization speedups for tasks as small as 5K cycles, and an average speedup of 12<inline-formula><tex-math notation="LaTeX">$\times$</tex-math><alternatives> <inline-graphic xlink:href="tagliavini-ieq1-2814602.gif"/></alternatives></inline-formula> for real benchmarks, which is <inline-formula><tex-math notation="LaTeX">$\approx 60\%$</tex-math><alternatives> <inline-graphic xlink:href="tagliavini-ieq2-2814602.gif"/></alternatives></inline-formula> higher than what we observe with the original Kalray OpenMP implementation.

[1]  Chris D. Marlin Coroutines: A Programming Methodology, a Language Design and an Implementation , 1980, Lecture Notes in Computer Science.

[2]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[3]  Eduard Ayguadé,et al.  Nanos mercurium: A research compiler for OpenMP , 2004 .

[4]  D. Novillo OpenMP and automatic parallelization in GCC Diego , 2006 .

[5]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[6]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[7]  Muhammad Shafique,et al.  RISPP: Rotating Instruction Set Processing Platform , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[8]  Alejandro Duran,et al.  An adaptive cut-off for task parallelism , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Alejandro Duran,et al.  Evaluation of OpenMP Task Scheduling Strategies , 2008, IWOMP.

[10]  Barbara M. Chapman,et al.  Implementing OpenMP on a high performance embedded multicore MPSoC , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[11]  Karl-Filip Faxén,et al.  Wool-A work stealing library , 2008, CARN.

[12]  Alejandro Duran,et al.  Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP , 2009, 2009 International Conference on Parallel Processing.

[13]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[14]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[15]  Yi Guo,et al.  Work-first and help-first scheduling policies for async-finish task parallelism , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[16]  Muhammad Shafique,et al.  KAHRISMA: A Novel Hypermorphic Reconfigurable-Instruction-Set Multi-grained-Array Architecture , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[17]  Spiros N. Agathos,et al.  Design and Implementation of OpenMP Tasks in the OMPi Compiler , 2011, 2011 15th Panhellenic Conference on Informatics.

[18]  Kazuki Sakamoto,et al.  Grand Central Dispatch , 2012 .

[19]  Jörg Henkel,et al.  Invasive manycore architectures , 2012, 17th Asia and South Pacific Design Automation Conference.

[20]  Luca Benini,et al.  Platform 2012, a many-core computing accelerator for embedded SoCs: Performance evaluation of visual analytics applications , 2012, DAC Design Automation Conference 2012.

[21]  Spiros N. Agathos,et al.  Deploying OpenMP on an embedded multicore accelerator , 2013, 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

[22]  Cheng Wang,et al.  libEOMP: a portable OpenMP runtime library based on MCA APIs for embedded systems , 2013, PMAM '13.

[23]  Luca Benini,et al.  Enabling fine-grained OpenMP tasking on tightly-coupled shared memory clusters , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[24]  Alistair P. Rendell,et al.  OpenMP on the Low-Power TI Keystone II ARM/DSP System-on-Chip , 2013, IWOMP.

[25]  Alistair P. Rendell,et al.  Implementation and Optimization of the OpenMP Accelerator Model for the TI Keystone II Architecture , 2014, IWOMP.

[26]  Eduardo Quiñones,et al.  P-SOCRATES: A Parallel Software Framework for Time-Critical Many-Core Systems , 2014, 2014 17th Euromicro Conference on Digital System Design.

[27]  Luca Benini,et al.  Architecture Support for Tightly-Coupled Multi-Core Clusters with Shared-Memory HW Accelerators , 2015, IEEE Transactions on Computers.

[28]  Mats Brorsson,et al.  A comparative performance study of common and popular task‐centric programming frameworks , 2015, Concurr. Comput. Pract. Exp..

[29]  Luca Benini,et al.  Simplifying Many-Core-Based Heterogeneous SoC Programming With Offload Directives , 2015, IEEE Transactions on Industrial Informatics.

[30]  Eduardo Quiñones,et al.  Timing characterization of OpenMP4 tasking model , 2015, 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[31]  Eduardo Quiñones,et al.  OpenMP and timing predictability: A possible union? , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[32]  Indrani Paul,et al.  Achieving Exascale Capabilities through Heterogeneous Computing , 2015, IEEE Micro.

[33]  Sunita Chandrasekaran,et al.  Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API , 2016, Euro-Par Workshops.

[34]  Sven Karlsson,et al.  Towards Unifying OpenMP Under the Task-Parallel Paradigm - Implementation and Performance of the taskloop Construct , 2016, IWOMP.

[35]  VirtualSoC: A Research Tool for Modern MPSoCs , 2016, ACM Trans. Embed. Comput. Syst..

[36]  Soonwook Hwang,et al.  Resource Allocation Policies for Loosely Coupled Applications in Heterogeneous Computing Systems , 2016, IEEE Transactions on Parallel and Distributed Systems.

[37]  Eduardo Quiñones,et al.  Response-time analysis of DAG tasks under fixed priority scheduling with limited preemptions , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[38]  Maria A. Serrano,et al.  A lightweight OpenMP4 run-time for embedded systems , 2016, 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC).

[39]  Jie Shen,et al.  Workload Partitioning for Accelerating Applications on Heterogeneous Platforms , 2016, IEEE Transactions on Parallel and Distributed Systems.

[40]  Luca Benini,et al.  Lightweight Virtual Memory Support for Zero-Copy Sharing of Pointer-Rich Data Structures in Heterogeneous Embedded SoCs , 2017, IEEE Transactions on Parallel and Distributed Systems.

[41]  Emmanuel Agullo,et al.  Bridging the Gap Between OpenMP and Task-Based Runtime Systems for the Fast Multipole Method , 2017, IEEE Transactions on Parallel and Distributed Systems.

[42]  Torsten Hoefler,et al.  Isoefficiency in Practice: Configuring and Understanding the Performance of Task-based Applications , 2017, PPoPP.

[43]  Jörg Henkel,et al.  Timing Analysis of Tasks on Runtime Reconfigurable Processors , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[44]  Eduardo Quiñones,et al.  An Analysis of Lazy and Eager Limited Preemption Approaches under DAG-Based Global Fixed Priority Scheduling , 2017, 2017 IEEE 20th International Symposium on Real-Time Distributed Computing (ISORC).