Exploiting mixed SIMD parallelism by reducing data reorganization overhead

Existing loop vectorization techniques can exploit either intra-or inter-iteration SIMD parallelism alone in a code region if one part of the region vectorized for one type of parallelism has data dependences (called mixed-parallelism-inhibiting dependences) on the other part of the region vectorized for the other type of parallelism. In this paper, we consider a class of loops that exhibit both types of parallelism (i.e., mixed SIMD parallelism) in its code regions that contain mixed-parallelism-inhibiting data dependences. We present a new compiler approach for exploiting such mixed SIMD parallelism effectively by reducing the data reorganization overhead incurred when one type of parallelism is switched to the other. Our auto-vectorizer is simple and has been implemented in LLVM (3.5.0). We evaluate it on seven benchmarks with mixed SIMD parallelism selected from SPEC and NAS benchmark suites and demonstrate its performance advantages over the state-of-the-art.

[1]  Albert Cohen,et al.  Polyhedral-Model Guided Loop-Nest Auto-Vectorization , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[2]  Timothy M. Jones,et al.  PSLP: Padded SLP automatic vectorization , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[3]  Mahmut T. Kandemir,et al.  A compiler framework for extracting superword level parallelism , 2012, PLDI '12.

[4]  R. C. Whaley,et al.  Vectorization past dependent branches through speculation , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[5]  Jaewook Shin,et al.  Superword-level parallelism in the presence of control flow , 2005, International Symposium on Code Generation and Optimization.

[6]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[7]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[8]  Peng Zhao,et al.  An integrated simdization framework using virtual vectors , 2005, ICS '05.

[9]  Jaewook Shin Introducing Control Flow into Vectorized Code , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[10]  Ayal Zaks,et al.  Auto-vectorization of interleaved data for SIMD , 2006, PLDI '06.

[11]  Barbara M. Chapman,et al.  Supercompilers for parallel and vector computers , 1990, ACM Press frontier series.

[12]  Franz Franchetti,et al.  Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures , 2011, CC.

[13]  Jaewook Shin,et al.  Compiler-controlled caching in superword register files for multimedia extension architectures , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[14]  Vivek Sarkar,et al.  Efficient Selection of Vector Instructions Using Dynamic Programming , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[15]  Seonggun Kim,et al.  Efficient SIMD code generation for irregular kernels , 2012, PPoPP '12.

[16]  Gang Ren,et al.  Optimizing data permutations for SIMD devices , 2006, PLDI '06.

[17]  Scott A. Mahlke,et al.  SIMD defragmenter: efficient ILP realization on data-parallel architectures , 2012, ASPLOS XVII.

[18]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[19]  Sumit Gulwani,et al.  From relational verification to SIMD loop synthesis , 2013, PPoPP '13.

[20]  David A. Padua,et al.  An Evaluation of Vectorizing Compilers , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[21]  Sebastian Hack,et al.  Whole-function vectorization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[22]  Emmett Witchel,et al.  Increasing and detecting memory address congruence , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[23]  Richard Veras,et al.  When polyhedral transformations meet SIMD code generation , 2013, PLDI.

[24]  Timothy M. Jones,et al.  Throttling Automatic Vectorization: When Less is More , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).