Energy-Efficient Run-Time Mapping and Thread Partitioning of Concurrent OpenCL Applications on CPU-GPU MPSoCs

Heterogeneous Multi-Processor Systems-on-Chips (MPSoCs) containing CPU and GPU cores are typically required to execute applications concurrently. However, as will be shown in this paper, existing approaches are not well suited for concurrent applications as they are developed either by considering only a single application or they do not exploit both CPU and GPU cores at the same time. In this paper, we propose an energy-efficient run-time mapping and thread partitioning approach for executing concurrent OpenCL applications on both GPU and GPU cores while satisfying performance requirements. Depending upon the performance requirements, for each concurrently executing application, the mapping process finds the appropriate number of CPU cores and operating frequencies of CPU and GPU cores, and the partitioning process identifies an efficient partitioning of the applications’ threads between CPU and GPU cores. We validate the proposed approach experimentally on the Odroid-XU3 hardware platform with various mixes of applications from the Polybench benchmark suite. Additionally, a case-study is performed with a real-world application SLAMBench. Results show an average energy saving of 32% compared to existing approaches while still satisfying the performance requirements.

[1]  Hao Wang,et al.  Workload and power budget partitioning for single-chip heterogeneous processors , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[2]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Michael F. P. O'Boyle,et al.  A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL , 2011, CC.

[4]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[5]  Yi-Ping You,et al.  VirtCL: a framework for OpenCL device abstraction and management , 2015, PPoPP.

[6]  Michael F. P. O'Boyle,et al.  Partitioning data-parallel programs for heterogeneous MPSoCs: time and energy design space exploration , 2014, LCTES '14.

[7]  Nikil D. Dutt,et al.  SPARTA: Runtime task allocation for energy efficient heterogeneous manycores , 2016, 2016 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[8]  Xiaowei Li,et al.  An Analytical Framework for Estimating Scale-Out and Scale-Up Power Efficiency of Heterogeneous Manycores , 2016, IEEE Transactions on Computers.

[9]  Tulika Mitra,et al.  Energy-efficient execution of data-parallel applications on heterogeneous mobile platforms , 2015, 2015 33rd IEEE International Conference on Computer Design (ICCD).

[10]  Alexandre Yakovlev,et al.  Power--Aware Performance Adaptation of Concurrent Applications in Heterogeneous Many-Core Systems , 2016, ISLPED.

[11]  Geoff V. Merrett,et al.  Learning-Based Run-Time Power and Energy Management of Multi/Many-Core Systems: Current and Future Trends , 2017, J. Low Power Electron..

[12]  Jaejin Lee,et al.  OpenCL framework for ARM processors with NEON support , 2014, WPMVP '14.

[13]  Michael Glaß,et al.  Automatic operating point distillation for hybrid mapping methodologies , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[14]  Marco D. Santambrogio,et al.  Workload-aware power optimization strategy for asymmetric multiprocessors , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[15]  Piotr Dziurzanski,et al.  A Survey and Comparative Study of Hard and So Real-time Dynamic Resource Allocation Strategies for Multi / Many-core Systems , 2017 .

[16]  Sudhakar Yalamanchili,et al.  Coordinated energy management in heterogeneous processors , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[17]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[18]  Lieven Eeckhout,et al.  Scheduling heterogeneous multi-cores through performance impact estimation (PIE) , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[19]  Michael F. P. O'Boyle,et al.  Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM , 2014, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Henry Hoffmann,et al.  Racing and Pacing to Idle: Theoretical and Empirical Analysis of Energy Optimization Heuristics , 2015, 2015 IEEE 3rd International Conference on Cyber-Physical Systems, Networks, and Applications.

[21]  Geoff V. Merrett,et al.  Dataset supporting the article entitled "ITMD: Run-time Management of Concurrent Multi-Threaded Applications on Heterogeneous Multi-cores" , 2017 .

[22]  Hao Wang,et al.  Memory scheduling towards high-throughput cooperative heterogeneous computing , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[23]  Alex Ramírez,et al.  Energy Efficient HPC on Embedded SoCs: Optimization Techniques for Mali GPU , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[24]  Cédric Bastoul,et al.  Switchable Scheduling for Runtime Adaptation of Optimization , 2014, Euro-Par.

[25]  Rüdiger Kapitza,et al.  Proactive Energy-Aware Programming with PEEK , 2014, TRIOS.

[26]  Michael F. P. O'Boyle,et al.  Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[27]  R. Govindarajan,et al.  Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices , 2014, CGO '14.

[28]  Amit Kumar Singh,et al.  Mapping on multi/many-core systems: Survey of current and emerging trends , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[29]  Ali Karami,et al.  A statistical performance analyzer framework for OpenCL kernels on Nvidia GPUs , 2014, The Journal of Supercomputing.

[30]  Michael F. P. O'Boyle,et al.  OpenCL Task Partitioning in the Presence of GPU Contention , 2013, LCPC.