Function Call Re-Vectorization

Programming languages such as C for CUDA, OpenCL or ISPC have contributed to increase the programmability of SIMD accelerators and graphics processing units. However, these languages still lack the flexibility offered by low-level SIMD programming on explicit vectors. To close this expressiveness gap while preserving performance, this paper introduces the notion of \ourinvention{} (CREV). CREV allows changing the dimension of vectorization during the execution of a kernel, exposing it as a nested parallel kernel call. CREV affords programmability close to dynamic parallelism, a feature that allows the invocation of kernels from inside kernels, but at much lower cost. In this paper, we present a formal semantics of CREV, and an implementation of it on the ISPC compiler. We have used CREV to implement some classic algorithms, including string matching, depth first search and Bellman-Ford, with minimum effort. These algorithms, once compiled by ISPC to Intel-based vector instructions, are as fast as state-of-the-art implementations, yet much simpler. Thus, CREV gives developers the elegance of dynamic programming, and the performance of explicit SIMD programming.

[1]  Xipeng Shen,et al.  On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.

[2]  Sudhakar Yalamanchili,et al.  Characterization and analysis of dynamic parallelism in unstructured GPU applications , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[3]  Guoyang Chen,et al.  Free launch: Optimizing GPU dynamic kernel launches through thread reuse , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[5]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[6]  Sebastian Hack,et al.  The Impact of the SIMD Width on Control-Flow and Memory Divergence , 2014, ACM Trans. Archit. Code Optim..

[7]  Yi Yang,et al.  CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications , 2015, Journal of Computer Science and Technology.

[8]  Alan M. Frieze,et al.  Random graphs , 2006, SODA '06.

[9]  Jin Wang,et al.  Dynamic Thread Block Launch: A lightweight execution mechanism to support irregular applications on GPUs , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[10]  Laxmi N. Bhuyan,et al.  Efficient warp execution in presence of divergence with collaborative context collection , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  Sudhakar Yalamanchili,et al.  LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[12]  Yen-Chen Liu,et al.  Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[13]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[14]  Fernando Magno Quintão Pereira,et al.  Divergence analysis , 2013, ACM Trans. Program. Lang. Syst..

[15]  T. Lindvall ON A ROUTING PROBLEM , 2004, Probability in the Engineering and Informational Sciences.

[16]  Jingyue Wu,et al.  gpucc: An open-source GPGPU compiler , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[17]  M. Pharr,et al.  ispc: A SPMD compiler for high-performance CPU programming , 2012, 2012 Innovative Parallel Computing (InPar).

[18]  Roman Novak,et al.  Loop Optimization for Divergence Reduction on GPUs with SIMT Architecture , 2015, IEEE Transactions on Parallel and Distributed Systems.

[19]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[20]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[21]  Luc Bougé,et al.  Control structures for data-parallel SIMD languages: semantics and implementation , 1992, Future Gener. Comput. Syst..

[22]  A. Grimshaw,et al.  High Performance and Scalable Radix Sorting: a Case Study of Implementing Dynamic Parallelism for GPU Computing , 2011, Parallel Process. Lett..

[23]  Michela Taufer,et al.  Performance impact of dynamic parallelism on different clustering algorithms , 2013, Defense, Security, and Sensing.

[24]  Peng Tu,et al.  Writing scalable SIMD programs with ISPC , 2014, WPMVP '14.

[25]  Michael Garland,et al.  Understanding throughput-oriented architectures , 2010, Commun. ACM.

[26]  Fernando Magno Quintão Pereira,et al.  Profiling divergences in GPU applications , 2013, Concurr. Comput. Pract. Exp..

[27]  Benedict R. Gaster An Execution Model for OpenCL 2.0 , 2014 .

[28]  Timothy G. Mattson,et al.  OpenCL Programming Guide , 2011 .

[29]  Dorota H. Kieronska,et al.  Formal Specification of Parallel SIMD Execution , 1996, Theor. Comput. Sci..

[30]  Fernando Magno Quintão Pereira,et al.  Divergence Analysis and Optimizations , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[31]  Tarek S. Abdelrahman,et al.  Reducing divergence in GPGPU programs with loop merging , 2013, GPGPU@ASPLOS.

[32]  Fernando Magno Quintão Pereira,et al.  Divergence Analysis with Affine Constraints , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[33]  Jie Cheng,et al.  CUDA by Example: An Introduction to General-Purpose GPU Programming , 2010, Scalable Comput. Pract. Exp..

[34]  Nicolas Pinto,et al.  PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation , 2009, Parallel Comput..

[35]  Ronan Keryell,et al.  POMP or How to Design a Massively Parallel Machine with Small Developments , 1991, PARLE.

[36]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.