Automatic Vectorization of Interleaved Data Revisited

Automatically exploiting short vector instructions sets (SSE, AVX, NEON) is a critically important task for optimizing compilers. Vector instructions typically work best on data that is contiguous in memory, and operating on non-contiguous data requires additional work to gather and scatter the data. There are several varieties of non-contiguous access, including interleaved data access. An existing approach used by GCC generates extremely efficient code for loops with power-of-2 interleaving factors (strides). In this paper we propose a generalization of this approach that produces similar code for any compile-time constant interleaving factor. In addition, we propose several novel program transformations, which were made possible by our generalized representation of the problem. Experiments show that our approach achieves significant speedups for both power-of-2 and non--power-of-2 interleaving factors. Our vectorization approach results in mean speedups over scalar code of 1.77x on Intel SSE and 2.53x on Intel AVX2 in real-world benchmarking on a selection of BLAS Level 1 routines. On the same benchmark programs, GCC 5.0 achieves mean improvements of 1.43x on Intel SSE and 1.30x on Intel AVX2. In synthetic benchmarking on Intel SSE, our maximum improvement on data movement is over 4x for gathering operations and over 6x for scattering operations versus scalar code.

[1]  Mahmut T. Kandemir,et al.  A compiler framework for extracting superword level parallelism , 2012, PLDI '12.

[2]  Franz Franchetti,et al.  A SIMD vectorizing compiler for digital signal processing algorithms , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[3]  Scott A. Mahlke,et al.  SIMD defragmenter: efficient ILP realization on data-parallel architectures , 2012, ASPLOS XVII.

[4]  Gang Ren,et al.  Optimizing data permutations for SIMD devices , 2006, PLDI '06.

[5]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[6]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[7]  Peter Kogge,et al.  Generation of permutations for SIMD processors , 2005, LCTES '05.

[8]  Jaewook Shin,et al.  Superword-level parallelism in the presence of control flow , 2005, International Symposium on Code Generation and Optimization.

[9]  Alfred V. Aho,et al.  Code generation using tree matching and dynamic programming , 1989, ACM Trans. Program. Lang. Syst..

[10]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[11]  Ruby B. Lee Accelerating multimedia with enhanced microprocessors , 1995, IEEE Micro.

[12]  Kenneth H.Rosen,et al.  "Discrete Mathematics and its Applications", 7th Edition, Tata Mc Graw Hill Pub. Co. Ltd., New Delhi, Special Indian Edition, 2011 , 2015 .

[13]  ZaksAyal,et al.  Auto-vectorization of interleaved data for SIMD , 2006 .

[14]  Ayal Zaks,et al.  Auto-vectorization of interleaved data for SIMD , 2006, PLDI '06.

[15]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[16]  PaduaDavid,et al.  Optimizing data permutations for SIMD devices , 2006 .

[17]  KudriavtsevAlexei,et al.  Generation of permutations for SIMD processors , 2005 .

[18]  Kenneth H. Rosen,et al.  Discrete Mathematics and its applications , 2000 .

[19]  Lizy Kurian John,et al.  Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements , 2003, IEEE Trans. Computers.

[20]  Sebastian Hack,et al.  The Impact of the SIMD Width on Control-Flow and Memory Divergence , 2014, ACM Trans. Archit. Code Optim..

[21]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[22]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .

[23]  Jaewook Shin,et al.  Compiler-controlled caching in superword register files for multimedia extension architectures , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[24]  Vivek Sarkar,et al.  Efficient Selection of Vector Instructions Using Dynamic Programming , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[25]  Albert Cohen,et al.  A Polyhedral Approach to Ease the Composition of Program Transformations , 2004, Euro-Par.

[26]  Dhananjay M. Dhamdhere,et al.  Efficient Retargetable Code Generation Using Bottom-up Tree Pattern Matching , 1990, Comput. Lang..

[27]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[28]  Sebastian Hack,et al.  Whole-function vectorization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[29]  Erez Petrank,et al.  New Algorithms for SIMD Alignment , 2007, CC.

[30]  David A. Padua,et al.  An Evaluation of Vectorizing Compilers , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[31]  Christopher W. Fraser,et al.  BURG: fast optimal instruction selection and tree parsing , 1992, SIGP.

[32]  Franz Franchetti,et al.  Generating SIMD Vectorized Permutations , 2008, CC.

[33]  LiuJun,et al.  A compiler framework for extracting superword level parallelism , 2012 .