Adaptive optimization for OpenCL programs on embedded heterogeneous systems

Heterogeneous multi-core architectures consisting of CPUs and GPUs are commonplace in today’s embedded systems. These architectures offer potential for energy efficient computing if the application task is mapped to the right core. Realizing such potential is challenging due to the complex and evolving nature of hardware and applications. This paper presents an automatic approach to map OpenCL kernels onto heterogeneous multi-cores for a given optimization criterion – whether it is faster runtime, lower energy consumption or a trade-off between them. This is achieved by developing a machine learning based approach to predict which processor to use to run the OpenCL kernel and the host program, and at what frequency the processor should operate. Instead of hand-tuning a model for each optimization metric, we use machine learning to develop a unified framework that first automatically learns the optimization heuristic for each metric off-line, then uses the learned knowledge to schedule OpenCL kernels at runtime based on code and runtime information of the program. We apply our approach to a set of representative OpenCL benchmarks and evaluate it on an ARM big.LITTLE mobile platform. Our approach achieves over 93% of the performance delivered by a perfect predictor.We obtain, on average, 1.2x, 1.6x, and 1.8x improvement respectively for runtime, energy consumption and the energy delay product when compared to a comparative heterogeneous-aware OpenCL task mapping scheme.

[1]  Michael F. P. O'Boyle,et al.  Partitioning streaming parallelism for multi-cores: A machine learning based approach , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[2]  Michael F. P. O'Boyle,et al.  Smart, adaptive mapping of parallelism in the presence of external workload , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[3]  Aaron Smith,et al.  A machine learning approach to mapping streaming workloads to dynamic multicore processors , 2016, LCTES.

[4]  Christopher C. Cummins,et al.  Synthesizing benchmarks for predictive modeling , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[5]  Sherief Reda,et al.  Scheduling challenges and opportunities in integrated CPU+GPU processors , 2016, 2016 14th ACM/IEEE Symposium on Embedded Systems For Real-time Multimedia (ESTIMedia).

[6]  José A. Martínez,et al.  An approach to optimise the energy efficiency of iterative computation on integrated GPU–CPU systems , 2016, The Journal of Supercomputing.

[7]  Michael F. P. O'Boyle,et al.  Integrating profile-driven parallelism detection and machine-learning-based mapping , 2014, TACO.

[8]  Margaret Martonosi,et al.  GPU Performance and Power Tuning Using Regression Trees , 2015, TACO.

[9]  Stijn Eyerman,et al.  Probabilistic job symbiosis modeling for SMT processor scheduling , 2010, ASPLOS XV.

[10]  Jarmo Takala,et al.  pocl: A Performance-Portable OpenCL Implementation , 2014, International Journal of Parallel Programming.

[11]  Michael F. P. O'Boyle,et al.  Partitioning data-parallel programs for heterogeneous MPSoCs: time and energy design space exploration , 2014, LCTES '14.

[12]  Pavlos Petoumenos,et al.  Minimizing the cost of iterative compilation with active learning , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[13]  Alan Burns,et al.  Reducing the Implementation Overheads of IPCP and DFP , 2015, 2015 IEEE Real-Time Systems Symposium.

[14]  Michael F. P. O'Boyle,et al.  Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems , 2014, ACM Trans. Archit. Code Optim..

[15]  Soheil Ghiasi,et al.  CNNdroid: GPU-Accelerated Execution of Trained Deep Convolutional Neural Networks on Android , 2015, ACM Multimedia.

[16]  Amit Kumar Singh,et al.  Mapping on multi/many-core systems: Survey of current and emerging trends , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[17]  Yunheung Paek,et al.  Energy-Reduction Offloading Technique for Streaming Media Servers , 2016, Mob. Inf. Syst..

[18]  Vivek Sarkar,et al.  Automatic data layout generation and kernel mapping for CPU+GPU architectures , 2016, CC.

[19]  Wei Chen,et al.  GreenGPU: A Holistic Approach to Energy Efficiency in GPU-CPU Heterogeneous Architectures , 2012, 2012 41st International Conference on Parallel Processing.

[20]  Ge Yu,et al.  Schedulability analysis of preemptive and nonpreemptive EDF on partial runtime-reconfigurable FPGAs , 2008, TODE.

[21]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[22]  Dongyoung Kim,et al.  Zero and data reuse-aware fast convolution for deep neural networks on GPU , 2016, 2016 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[23]  Shashank Shekhar,et al.  Opportunity for compute partitioning in pursuit of energy-efficient systems , 2016, LCTES.

[24]  Rainer Leupers,et al.  MAPS: An integrated framework for MPSoC application parallelization , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[25]  A. Snavely,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[26]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[27]  Yue Zhao,et al.  EffiSha: A Software Framework for Enabling Effficient Preemptive Scheduling of GPU , 2017, PPoPP.

[28]  Scott A. Mahlke,et al.  Orchestrating Multiple Data-Parallel Kernels on Multiple Devices , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[29]  Michael F. P. O'Boyle,et al.  Portable mapping of data parallel programs to OpenCL for heterogeneous systems , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[30]  Keshav Pingali,et al.  Adaptive heterogeneous scheduling for integrated GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[31]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[32]  Yunheung Paek,et al.  Accelerating bootstrapping in FHEW using GPUs , 2015, 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[33]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[34]  Michael F. P. O'Boyle,et al.  Mapping parallelism to multi-cores: a machine learning based approach , 2009, PPoPP '09.

[35]  Henry Hoffmann,et al.  Bard: A unified framework for managing soft timing and power constraints , 2016, 2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS).

[36]  Michael F. P. O'Boyle,et al.  A workload-aware mapping approach for data-parallel programs , 2011, HiPEAC.

[37]  Ling Gao,et al.  Optimise web browsing on heterogeneous mobile platforms: A machine learning based approach , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[38]  Michael F. P. O'Boyle,et al.  Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping , 2009, PLDI '09.

[39]  Michael F. P. O'Boyle,et al.  Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[40]  R. Govindarajan,et al.  Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices , 2014, CGO '14.

[41]  Michael F. P. O'Boyle,et al.  OpenCL Task Partitioning in the Presence of GPU Contention , 2013, LCPC.

[42]  Michael F. P. O'Boyle,et al.  Automatic optimization of thread-coarsening for graphics processors , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).