论文信息 - A Compiler Approach for Exploiting Partial SIMD Parallelism

A Compiler Approach for Exploiting Partial SIMD Parallelism

Existing vectorization techniques are ineffective for loops that exhibit little loop-level parallelism but some limited superword-level parallelism (SLP). We show that effectively vectorizing such loops requires partial vector operations to be executed correctly and efficiently, where the degree of partial SIMD parallelism is smaller than the SIMD datapath width. We present a simple yet effective SLP compiler technique called Paver (PArtial VEctorizeR), formulated and implemented in LLVM as a generalization of the traditional SLP algorithm, to optimize such partially vectorizable loops. The key idea is to maximize SIMD utilization by widening vector instructions used while minimizing the overheads caused by memory access, packing/unpacking, and/or masking operations, without introducing new memory errors or new numeric exceptions. For a set of 9 C/C++/Fortran applications with partial SIMD parallelism, Paver achieves significantly better kernel and whole-program speedups than LLVM on both Intel’s AVX and ARM’s NEON.

Hao Zhou | Jingling Xue

[1] Albert Cohen,et al. Polyhedral-Model Guided Loop-Nest Auto-Vectorization , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[2] Timothy M. Jones,et al. PSLP: Padded SLP automatic vectorization , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[3] Jim Jeffers,et al. N -Body simulation , 2016 .

[4] Emmett Witchel,et al. Increasing and detecting memory address congruence , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[5] Jaewook Shin,et al. Superword-level parallelism in the presence of control flow , 2005, International Symposium on Code Generation and Optimization.

[6] Michael Wolfe,et al. More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[7] Jaewook Shin,et al. Compiler-controlled caching in superword register files for multimedia extension architectures , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[8] Saman P. Amarasinghe,et al. Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[9] Peng Wu,et al. Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[10] Jaewook Shin. Introducing Control Flow into Vectorized Code , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[11] Michael Goldfarb,et al. Automatic vectorization of tree traversals , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[12] James R. Larus,et al. SIMD parallelization of applications that traverse irregular data structures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[13] Richard Veras,et al. When polyhedral transformations meet SIMD code generation , 2013, PLDI.

[14] David A. Padua,et al. An Evaluation of Vectorizing Compilers , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[15] Ayal Zaks,et al. Auto-vectorization of interleaved data for SIMD , 2006, PLDI '06.

[16] Vivek Sarkar,et al. Efficient Selection of Vector Instructions Using Dynamic Programming , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[17] Barbara M. Chapman,et al. Supercompilers for parallel and vector computers , 1990, ACM Press frontier series.

[18] Jingling Xue,et al. Region-Based Selective Flow-Sensitive Pointer Analysis , 2014, SAS.

[19] Hao Zhou,et al. Exploiting mixed SIMD parallelism by reducing data reorganization overhead , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[20] Jingling Xue,et al. Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[21] Franz Franchetti,et al. SIMD Vectorization of Straight Line FFT Code , 2003, Euro-Par.

[22] Scott A. Mahlke,et al. SIMD defragmenter: efficient ILP realization on data-parallel architectures , 2012, ASPLOS XVII.

[23] Ralf Karrenberg,et al. Automatic SIMD Vectorization of SSA-based Control Flow Graphs , 2015, Springer Fachmedien Wiesbaden.

[24] Seonggun Kim,et al. Efficient SIMD code generation for irregular kernels , 2012, PPoPP '12.

[25] Gang Ren,et al. Optimizing data permutations for SIMD devices , 2006, PLDI '06.

[26] Aart J. C. Bik,et al. Automatic Intra-Register Vectorization for the Intel® Architecture , 2002, International Journal of Parallel Programming.

[27] Peng Zhao,et al. An integrated simdization framework using virtual vectors , 2005, ICS '05.

[28] Albert Cohen,et al. Vapor SIMD: Auto-vectorize once, run everywhere , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[29] R. C. Whaley,et al. Vectorization past dependent branches through speculation , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[30] Jingling Xue,et al. Code tiling for improving the cache performance of PDE solvers , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[31] Mahmut T. Kandemir,et al. A compiler framework for extracting superword level parallelism , 2012, PLDI '12.

[32] Mithuna Thottethodi,et al. Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.

[33] Sebastian Hack,et al. Whole-function vectorization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[34] Franz Franchetti,et al. Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets , 2011, ICS '11.

[35] R. Govindarajan,et al. A Vectorizing Compiler for Multimedia Extensions , 2000, International Journal of Parallel Programming.

[36] Karthikeyan Sankaralingam,et al. Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.