论文信息 - Function Call Re-Vectorization

Function Call Re-Vectorization

Programming languages such as C for CUDA, OpenCL or ISPC have contributed to increase the programmability of SIMD accelerators and graphics processing units. However, these languages still lack the flexibility offered by low-level SIMD programming on explicit vectors. To close this expressiveness gap while preserving performance, this paper introduces the notion of \ourinvention{} (CREV). CREV allows changing the dimension of vectorization during the execution of a kernel, exposing it as a nested parallel kernel call. CREV affords programmability close to dynamic parallelism, a feature that allows the invocation of kernels from inside kernels, but at much lower cost. In this paper, we present a formal semantics of CREV, and an implementation of it on the ISPC compiler. We have used CREV to implement some classic algorithms, including string matching, depth first search and Bellman-Ford, with minimum effort. These algorithms, once compiled by ISPC to Intel-based vector instructions, are as fast as state-of-the-art implementations, yet much simpler. Thus, CREV gives developers the elegance of dynamic programming, and the performance of explicit SIMD programming.

Fernando Magno Quintão Pereira | Sylvain Collange | Rubens E. A. Moreira | Caroline Collange

[1] Xipeng Shen,et al. On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.

[2] Sudhakar Yalamanchili,et al. Characterization and analysis of dynamic parallelism in unstructured GPU applications , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[3] Guoyang Chen,et al. Free launch: Optimizing GPU dynamic kernel launches through thread reuse , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4] Donald E. Knuth,et al. Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[5] Clifford Stein,et al. Introduction to Algorithms, 2nd edition. , 2001 .

[6] Sebastian Hack,et al. The Impact of the SIMD Width on Control-Flow and Memory Divergence , 2014, ACM Trans. Archit. Code Optim..

[7] Yi Yang,et al. CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications , 2015, Journal of Computer Science and Technology.

[8] Alan M. Frieze,et al. Random graphs , 2006, SODA '06.

[9] Jin Wang,et al. Dynamic Thread Block Launch: A lightweight execution mechanism to support irregular applications on GPUs , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[10] Laxmi N. Bhuyan,et al. Efficient warp execution in presence of divergence with collaborative context collection , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11] Sudhakar Yalamanchili,et al. LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[12] Yen-Chen Liu,et al. Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[13] William J. Dally,et al. The GPU Computing Era , 2010, IEEE Micro.

[14] Fernando Magno Quintão Pereira,et al. Divergence analysis , 2013, ACM Trans. Program. Lang. Syst..

[15] T. Lindvall. ON A ROUTING PROBLEM , 2004, Probability in the Engineering and Informational Sciences.

[16] Jingyue Wu,et al. gpucc: An open-source GPGPU compiler , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[17] M. Pharr,et al. ispc: A SPMD compiler for high-performance CPU programming , 2012, 2012 Innovative Parallel Computing (InPar).

[18] Roman Novak,et al. Loop Optimization for Divergence Reduction on GPUs with SIMT Architecture , 2015, IEEE Transactions on Parallel and Distributed Systems.

[19] Joe D. Warren,et al. The program dependence graph and its use in optimization , 1987, TOPL.

[20] Xin-She Yang,et al. Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[21] Luc Bougé,et al. Control structures for data-parallel SIMD languages: semantics and implementation , 1992, Future Gener. Comput. Syst..

[22] A. Grimshaw,et al. High Performance and Scalable Radix Sorting: a Case Study of Implementing Dynamic Parallelism for GPU Computing , 2011, Parallel Process. Lett..

[23] Michela Taufer,et al. Performance impact of dynamic parallelism on different clustering algorithms , 2013, Defense, Security, and Sensing.

[24] Peng Tu,et al. Writing scalable SIMD programs with ISPC , 2014, WPMVP '14.

[25] Michael Garland,et al. Understanding throughput-oriented architectures , 2010, Commun. ACM.

[26] Fernando Magno Quintão Pereira,et al. Profiling divergences in GPU applications , 2013, Concurr. Comput. Pract. Exp..

[27] Benedict R. Gaster. An Execution Model for OpenCL 2.0 , 2014 .

[28] Timothy G. Mattson,et al. OpenCL Programming Guide , 2011 .

[29] Dorota H. Kieronska,et al. Formal Specification of Parallel SIMD Execution , 1996, Theor. Comput. Sci..

[30] Fernando Magno Quintão Pereira,et al. Divergence Analysis and Optimizations , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[31] Tarek S. Abdelrahman,et al. Reducing divergence in GPGPU programs with loop merging , 2013, GPGPU@ASPLOS.

[32] Fernando Magno Quintão Pereira,et al. Divergence Analysis with Affine Constraints , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[33] Jie Cheng,et al. CUDA by Example: An Introduction to General-Purpose GPU Programming , 2010, Scalable Comput. Pract. Exp..

[34] Nicolas Pinto,et al. PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation , 2009, Parallel Comput..

[35] Ronan Keryell,et al. POMP or How to Design a Massively Parallel Machine with Small Developments , 1991, PARLE.

[36] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.