Efficient Selection of Vector Instructions Using Dynamic Programming

Accelerating program performance via SIMD vector units is very common in modern processors, as evidenced by the use of SSE, MMX, VSE, and VSX SIMD instructions in multimedia, scientific, and embedded applications. To take full advantage of the vector capabilities, a compiler needs to generate efficient vector code automatically. However, most commercial and open-source compilers fall short of using the full potential of vector units, and only generate vector code for simple innermost loops. In this paper, we present the design and implementation of anauto-vectorization framework in the back-end of a dynamic compiler that not only generates optimized vector code but is also well integrated with the instruction scheduler and register allocator. The framework includes a novel{\em compile-time efficient dynamic programming-based} vector instruction selection algorithm for straight-line code that expands opportunities for vectorization in the following ways: (1) {\em scalar packing} explores opportunities of packing multiple scalar variables into short vectors, (2)judicious use of {\em shuffle} and {\em horizontal} vector operations, when possible, and (3) {\em algebraic reassociation} expands opportunities for vectorization by algebraic simplification. We report performance results on the impact of auto-vectorization on a set of standard numerical benchmarks using the Jikes RVM dynamic compilation environment. Our results show performance improvement of up to 57.71\% on an Intel Xeon processor, compared tonon-vectorized execution, with a modest increase in compile-time in the range from 0.87\% to 9.992\%. An investigation of the SIMD parallelization performed by v11.1 of the Intel Fortran Compiler (IFC) on three benchmarks shows that our system achieves speedup with vectorization in all three cases and IFC does not. Finally, a comparison of our approach with an implementation of the Super word Level Parallelization (SLP) algorithm from~\cite{larsen00}, shows that our approach yields a performance improvement of up to 13.78\% relative to SLP.

[1]  David Ryan Koes,et al.  Near-optimal instruction selection on dags , 2008, CGO '08.

[2]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[3]  Laurie Hendren,et al.  Soot---a java optimization framework , 1999 .

[4]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[5]  Rodric M. Rabbah,et al.  Exploiting vector parallelism in software pipelined loops , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[6]  R. Nigel Horspool,et al.  Compiler optimizations for processors with SIMD instructions , 2007, Softw. Pract. Exp..

[7]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[8]  Uri C. Weiser,et al.  MMX technology extension to the Intel architecture , 1996, IEEE Micro.

[9]  Vivek Sarkar,et al.  Register-sensitive selection, duplication, and sequencing of instructions , 2001, ICS '01.

[10]  Peter Kogge,et al.  Generation of permutations for SIMD processors , 2005, LCTES '05.

[11]  Franz Franchetti,et al.  FFT Compiler Techniques , 2004, CC.

[12]  Aart J. C. Bik Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance , 2004 .

[13]  Stefan Kral FFT specific compilation on IBM blue gene , 2006, Ausgezeichnete Informatikdissertationen.

[14]  Hunter Scales,et al.  AltiVec Extension to PowerPC Accelerates Media Processing , 2000, IEEE Micro.

[15]  Andreas Krall,et al.  Compilation Techniques for Multimedia Processors , 2004, International Journal of Parallel Programming.

[16]  Alfred V. Aho,et al.  Optimal code generation for expression trees , 1975, STOC.

[17]  John L. Bruno,et al.  Code Generation for a One-Register Machine , 1976, J. ACM.

[18]  R. Leupers Code selection for media processors with SIMD instructions , 2000, Proceedings Design, Automation and Test in Europe Conference and Exhibition 2000 (Cat. No. PR00537).

[19]  Emmett Witchel,et al.  Increasing and detecting memory address congruence , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[20]  John A. Gunnels,et al.  A high-performance SIMD floating point unit for BlueGene/L: architecture, compilation, and algorithm design , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[21]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[22]  James F. Power,et al.  Platform independent dynamic Java virtual machine analysis: the Java Grande Forum benchmark suite , 2001, JGI '01.

[23]  Keith D. Cooper,et al.  Effective partial redundancy elimination , 1994, PLDI '94.