Auto-vectorization of interleaved data for SIMD

Most implementations of the Single Instruction Multiple Data (SIMD) model available today require that data elements be packed in vector registers. Operations on disjoint vector elements are not supported directly and require explicit data reorganization manipulations. Computations on non-contiguous and especially interleaved data appear in important applications, which can greatly benefit from SIMD instructions once the data is reorganized properly. Vectorizing such computations efficiently is therefore an ambitious challenge for both programmers and vectorizing compilers. We demonstrate an automatic compilation scheme that supports effective vectorization in the presence of interleaved data with constant strides that are powers of 2, facilitating data reorganization. We demonstrate how our vectorization scheme applies to dominant SIMD architectures, and present experimental results on a wide range of key kernels, showing speedups in execution time up to 3.7 for interleaving levels (stride) as high as 8.

[1]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[2]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[3]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[4]  Gang Ren,et al.  An empirical study on the vectorization of multimedia applications for multimedia extensions , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[5]  Hunter Scales,et al.  AltiVec Extension to PowerPC Accelerates Media Processing , 2000, IEEE Micro.

[6]  Uri C. Weiser,et al.  MMX technology extension to the Intel architecture , 1996, IEEE Micro.

[7]  Aart J. C. Bik The Software Vectorization Handbook: Apply-ing Multimedia Extensions for Maximum Performance , 2004 .

[8]  Peter Kogge,et al.  Generation of permutations for SIMD processors , 2005, LCTES '05.

[9]  Gilles Pokam,et al.  SWARP: a retargetable preprocessor for multimedia instructions , 2004, Concurr. Comput. Pract. Exp..

[10]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[11]  Peng Wu,et al.  Efficient SIMD code generation for runtime alignment and length conversion , 2005, International Symposium on Code Generation and Optimization.

[12]  Aart Johannes Casimir Bik The software vectorization handbook , 2004 .

[13]  Sameh W. Asaad,et al.  An innovative low-power high-performance programmable signal processor for digital communications , 2003, IBM J. Res. Dev..

[14]  Kevin B. Smith Support for the Intel ® Pentium ® 4 Processor with Hyper-Threading Technology in Intel ® 8 . 0 Compilers , 2004 .

[15]  Aart J. C. Bik Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance , 2004 .

[16]  Ken Kennedy,et al.  Practical dependence testing , 1991, PLDI '91.

[17]  Jason Merrill Generic and gimple: A new tree represen-tation for entire functions , 2003 .

[18]  Matthew Mattina,et al.  Tarantula: a vector extension to the alpha architecture , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[19]  Mateo Valero,et al.  Exploiting a new level of DLP in multimedia applications , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[20]  Diego Novillo Tree SSA A New Optimization Infrastructure for GCC , 2004 .

[21]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[22]  Jaewook Shin,et al.  Compiler-controlled caching in superword register files for multimedia extension architectures , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[23]  Andreas Krall,et al.  Pointer Alignment Analysis for Processors with SIMD Instructions , 2003 .

[24]  Gang Ren,et al.  Optimizing data permutations for SIMD devices , 2006, PLDI '06.

[25]  Lizy Kurian John,et al.  Exploiting SIMD parallelism in DSP and multimedia algorithms using the AltiVec technology , 1999, ICS '99.

[26]  Jaewook Shin,et al.  Superword-level parallelism in the presence of control flow , 2005, International Symposium on Code Generation and Optimization.

[27]  Ayal Zaks,et al.  Vectorizing for a SIMdD DSP architecture , 2003, CASES '03.

[28]  Albert Cohen,et al.  Induction Variable Analysis with Delayed Abstractions , 2005, HiPEAC.

[29]  John A. Gunnels,et al.  A high-performance SIMD floating point unit for BlueGene/L: architecture, compilation, and algorithm design , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[30]  Richard Henderson,et al.  Multi-platform auto-vectorization , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[31]  Gang Ren,et al.  A Preliminary Study on the Vectorization of Multimedia Applications for Multimedia Extensions , 2003, LCPC.

[32]  Aart J. C. Bik,et al.  Efficient Exploitation of Parallelism on Pentium III and Pentium 4 Processor-Based Systems , 2001 .

[33]  Franz Franchetti,et al.  Vectorization techniques for the Blue Gene/L double FPU , 2005, IBM J. Res. Dev..

[34]  Lizy Kurian John,et al.  Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements , 2003, IEEE Trans. Computers.

[35]  Krste Asanovic,et al.  Torrent Architecture Manual , 1997 .

[36]  PokamGilles,et al.  SWARP: a retargetable preprocessor for multimedia instructions , 2004 .