Toward an Analytical Performance Model to Select between GPU and CPU Execution

Automating the device selection in heterogeneous computing platforms requires the modelling of performance both on CPUs and on accelerators. This work argues for the use of a hybrid analytical performance modelling approach is a practical way to build fast and efficient methods to select an appropriate target for a given computation kernel. The target selection problem has been addressed in the literature, however there has been a strong emphasis on building empirical models with machine learning techniques. We argue that the applicability of such solutions is often limited in production systems. This paper focus on the issue of building a selector to decide if an OpenMP loop nest should be executed in a CPU or in a GPU. To this end, it offers a comprehensive comparison evaluation of the difference in GPU kernel performance on devices of multiple generations of architectures. The idea is to underscore the need for accurate analytical performance models and to provide insights in the evolution of GPU accelerators. This work also highlights a drawback of existing approaches to modelling GPU performance — accurate modelling of memory coalescing characteristics. To that end, we examine a novel application of an inter-thread difference analysis that can further improve analytical models. Finally, this work presents an initial study of an OpenMP runtime framework for target-offloading target selection.

[1]  Hesham El-Rewini,et al.  Scheduling Parallel Program Tasks onto Arbitrary Target Machines , 1990, J. Parallel Distributed Comput..

[2]  Marco Maggioni,et al.  Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[3]  Scott F. Midkiff,et al.  Heuristic Technique for Processor and Link Assignment in Multicomputers , 1991, IEEE Trans. Computers.

[4]  Sally A. McKee,et al.  Predicting parallel application performance via machine learning approaches , 2007, Concurr. Comput. Pract. Exp..

[5]  Prasun Gera,et al.  Performance Characterisation and Simulation of Intel's Integrated GPU Architecture , 2018, 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[6]  Gerhard Wellein,et al.  Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures , 2018, 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[7]  Nawwaf N. Kharma,et al.  A high performance algorithm for static task scheduling in heterogeneous distributed computing systems , 2008, J. Parallel Distributed Comput..

[8]  Michael F. P. O'Boyle,et al.  Mapping parallelism to multi-cores: a machine learning based approach , 2009, PPoPP '09.

[9]  Michael E. Wolf,et al.  Combining Loop Transformations Considering Caches and Scheduling , 2004, International Journal of Parallel Programming.

[10]  Millad Ghane,et al.  False Sharing Detection in OpenMP Applications Using OMPT API , 2015, IWOMP.

[11]  A. Snavely,et al.  Modeling application performance by convolving machine signatures with application profiles , 2001, Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization. WWC-4 (Cat. No.01EX538).

[12]  Satyajayant Misra,et al.  A Scalable Analytical Memory Model for CPU Performance Prediction , 2017, PMBS@SC.

[13]  Keshav Pingali,et al.  Adaptive heterogeneous scheduling for integrated GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[14]  José Nelson Amaral,et al.  Automated GPU Grid Geometry Selection for OPENMP Kernels , 2018, 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[15]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[16]  Wenguang Chen,et al.  OpenUH: an optimizing, portable OpenMP compiler: Research Articles , 2007 .

[17]  Barbara M. Chapman,et al.  Invited Paper: A Compile-time Cost Model for OpenMP , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[18]  Robert Dietrich,et al.  OMPT: An OpenMP Tools Application Programming Interface for Performance Analysis , 2013, IWOMP.

[19]  Xingfu Wu,et al.  Performance Modeling of Hybrid MPI/OpenMP Scientific Applications on Large-scale Multicore Cluster Systems , 2011, 2011 14th IEEE International Conference on Computational Science and Engineering.

[20]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[21]  Manuel Prieto,et al.  Fast finite difference Poisson solvers on heterogeneous architectures , 2014, Comput. Phys. Commun..

[22]  Fiona Reid,et al.  A Microbenchmark Suite for OpenMP Tasks , 2012, IWOMP.

[23]  Salvatore Venticinque,et al.  Performance prediction through simulation of a hybrid MPI/OpenMP application , 2005, Parallel Comput..

[24]  Alan D. George,et al.  FASE: A Framework for Scalable Performance Prediction of HPC Systems and Applications , 2007, Simul..

[25]  Sally A. McKee,et al.  Methods of inference and learning for performance modeling of parallel applications , 2007, PPoPP.

[26]  Sven-Bodo Scholz,et al.  Unibench: A Tool for Automated and Collaborative Benchmarking , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[27]  P. Sadayappan,et al.  Characterizing and enhancing global memory data coalescing on GPUs , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[28]  Vivek K. Pallipuram,et al.  Subjective versus objective: classifying analytical models for productive heterogeneous performance prediction , 2014, The Journal of Supercomputing.

[29]  Pedro Valero-Lara,et al.  Heterogeneous CPU+GPU approaches for mesh refinement over Lattice‐Boltzmann simulations , 2017, Concurr. Comput. Pract. Exp..

[30]  Sally A. McKee,et al.  An Approach to Performance Prediction for Parallel Applications , 2005, Euro-Par.