Graph Support and Scheduling for OpenCL on Heterogeneous Multi-core Systems

Computation on heterogeneous multi-core systems has great opportunities for optimization which may include the compute resource scheduling such as workload distribution between CPU and GPU, as well as finding the best combination of tasks and compute devices for best performance. Currently, OpenCL, the parallel programming standard for heterogeneous computing, contains mainly low-level APIs to interact with the runtime and hardware device of each individual vendor. To apply efficient scheduling algorithm, the overall execution flow and information of OpenCL kernels must be considered. In this paper, we proposed computational graph support for OpenCL. The framework features computational graphs that store meta-data and execution dependencies of kernels. We then provide scheduling framework for OpenCL programs based on the graph information. In our optimization framework, the kernel task scheduling is based on the graph model. In addition, we have kernel code analysis for target device decision as well as runtime work-group size optimization. The preliminary experimental results show that our scheme enables significant performance enhancement, achieving about 1.59 times speedup relative to our neural network program baseline.

[1]  Jungwon Kim,et al.  SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters , 2012, ICS '12.

[2]  Li Shen,et al.  Co-Run Scheduling with Power Cap on Integrated CPU-GPU Systems , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[3]  Chun-Chieh Yang,et al.  OpenCL 2.0 Compiler Adaptation on LLVM for PTX Simulators , 2017, 2017 46th International Conference on Parallel Processing Workshops (ICPPW).

[4]  Cheng-Yen Lin,et al.  Scheduling Methods for OpenVX Programs on Heterogeneous Multi-core Systems , 2015 .

[5]  Yi-Ping You,et al.  VirtCL: a framework for OpenCL device abstraction and management , 2015, PPoPP.

[6]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[7]  Kari Pulli,et al.  Addressing System-Level Optimization with OpenVX Graphs , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[8]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[9]  Chun-Chieh Yang,et al.  The Support of an Experimental OpenCL Compiler on HSA Environments , 2015 .

[10]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[11]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[12]  Wu-chun Feng,et al.  Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL , 2015, 2015 IEEE International Conference on Cluster Computing.

[13]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[14]  Michael F. P. O'Boyle,et al.  Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[15]  Kristofer Schlachter,et al.  An Introduction to the OpenCL Programming Model , 2012 .

[16]  Chun-Chieh Yang,et al.  Enabling PoCL-based runtime frameworks on the HSA for OpenCL 2.0 support , 2017, J. Syst. Archit..

[17]  Tianyi David Han,et al.  Reducing branch divergence in GPU programs , 2011, GPGPU-4.

[18]  A. B. Kahn,et al.  Topological sorting of large networks , 1962, CACM.