Multithreaded programming on the GPU: pointers and hints for the computer algebraist

It is well-known that the advent of hardware acceleration technologies (multicore processors, graphics processing units, field programmable gate arrays) provide vast opportunities for innovation in computing. In particular, GPUs combined with low-level heterogeneous programming models, such as CUDA (the Compute Unified Device Architecture, see [6, 7]), brought super-computing to the level of the desktop computer. However, these low-level programming models carry notable challenges, even to expert programmers. Indeed, fully exploiting the power of hardware accelerators by writing CUDA code often requires significant code optimization effort. This two-hour tutorial attempts to cover the key principles that computer algebraists interested in GPU programming should have in mind. The first half introduces the basics of GPU architecture and the CUDA programming model: no preliminary experience with GPU programming will be assumed; see [10] for a reference. In the second hour, we shall discuss the recent developments in terms of GPU architecture (e.g. dynamic parallelism [12]) and programming models (e.g. OpenMP [1, 9] and OpenACC [8, 11] as well as techniques for improving code performance (e.g MWP-CWP mode [4], TMM model [5], MCM model [3]). Illustrative examples are taken from the CUMODP library [2] for dense polynomial arithmetic over finite fields.

[1]  Sunita Chandrasekaran,et al.  Compiling a High-Level Directive-Based Programming Model for GPGPUs , 2013, LCPC.

[2]  Marc Moreno Maza,et al.  A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads , 2014, PARCO.

[3]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[4]  Marc Moreno Maza,et al.  Dense Arithmetic over Finite Fields with the CUMODP Library , 2014, ICMS.

[5]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[6]  Lin Ma,et al.  Performance modeling for highly-threaded many-core GPUs , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[7]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[8]  Jin Wang,et al.  Dynamic Thread Block Launch: A lightweight execution mechanism to support irregular applications on GPUs , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[9]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[10]  C. Martin 2015 , 2015, Les 25 ans de l’OMC: Une rétrospective en photos.