Simplifying Many-Core-Based Heterogeneous SoC Programming With Offload Directives

Multiprocessor systems-on-chip (MPSoC) are evolving into heterogeneous architectures based on one host processor plus many-core accelerators. While heterogeneous SoCs promise higher performance/watt, they are programmed at the cost of major code rewrites with low-level programming abstractions (e.g, OpenCL). We present a programming model based on OpenMP, with additional directives to program the accelerator from a single host program. As a test case, we evaluate an implementation of this programming model for the STMicroelectronics STHORM development board. We obtain near-ideal throughput for most benchmarks, very close performance to hand-optimized OpenCL codes at a significantly lower programming complexity, and up to 30× speedup versus host execution time.

[1]  Michael Klemm,et al.  OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison , 2012, MARC@RWTH.

[2]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[3]  Rabie Ben Atitallah,et al.  An Efficient Framework for Power-Aware Design of Heterogeneous MPSoC , 2013, IEEE Transactions on Industrial Informatics.

[4]  Narayanan Vijaykrishnan,et al.  Exploiting Heterogeneity for Energy Efficiency in Chip Multiprocessors , 2011, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[5]  Lothar Thiele,et al.  Dynamic Power-Aware Mapping of Applications onto Heterogeneous MPSoC Platforms , 2010, IEEE Transactions on Industrial Informatics.

[6]  Rainer Leupers,et al.  MAPS: Mapping Concurrent Dataflow Applications to Heterogeneous MPSoCs , 2013, IEEE Transactions on Industrial Informatics.

[7]  Luca Benini,et al.  An OpenMP Compiler for Efficient Use of Distributed Scratchpad Memory in MPSoCs , 2012, IEEE Transactions on Computers.

[8]  Luca Benini,et al.  Platform 2012, a many-core computing accelerator for embedded SoCs: Performance evaluation of visual analytics applications , 2012, DAC Design Automation Conference 2012.

[9]  Alistair P. Rendell,et al.  Implementation and Optimization of the OpenMP Accelerator Model for the TI Keystone II Architecture , 2014, IWOMP.

[10]  Lisa T. Su “Architecting the future through heterogeneous computing” , 2013, 2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers.

[11]  George L.-T. Chiu,et al.  Overview of the Blue Gene/L system architecture , 2005, IBM J. Res. Dev..

[12]  Ki-Seok Chung,et al.  Dynamic Power Management Technique for Multicore Based Embedded Mobile Devices , 2013, IEEE Transactions on Industrial Informatics.

[13]  Bronis R. de Supinski,et al.  Early Experiences with the OpenMP Accelerator Model , 2013, IWOMP.

[14]  Luca Benini,et al.  Tightly-coupled hardware support to dynamic parallelism acceleration in embedded shared memory clusters , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[15]  Eduard Ayguadé,et al.  On the Roles of the Programmer, the Compiler and the Runtime System When Programming Accelerators in OpenMP , 2014, IWOMP.

[16]  Leo White OpenMP Extensions for Heterogeneous Architectures , 2011, IWOMP.

[17]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[18]  Alfonso Niño,et al.  A Survey of Parallel Programming Models and Tools in the Multi and Many-core Era , 2022 .

[19]  Per Larsen,et al.  Expressing Coarse-Grain Dependencies Among Tasks in Shared Memory Programs , 2011, IEEE Transactions on Industrial Informatics.

[20]  Thomas Steinke,et al.  A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Sandeep K. Shukla,et al.  Guest Editors' Introduction: Special Section on System-Level Design and Validation of Heterogeneous Chip Multiprocessors , 2013 .

[22]  Alejandro Duran,et al.  A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures , 2009, IWOMP.

[23]  Luca Benini,et al.  Fast and lightweight support for nested parallelism on cluster-based embedded many-cores , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[24]  Benoît Dupont de Dinechin,et al.  A clustered manycore processor architecture for embedded and accelerated applications , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[25]  Michael Wolfe,et al.  Implementing the PGI Accelerator model , 2010, GPGPU-3.

[26]  Michael Klemm,et al.  From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture , 2012, Computing in Science & Engineering.