Outer-loop vectorization - revisited for short SIMD architectures

Vectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multimedia and embedded applications on short SIMD architectures such as MMX, SSE and AltiVec. Most of the focus has been directed at innermost loops, effectively executing their iterations concurrently as much as possible. Outer loop vectorization refers to vectorizing a level of a loop nest other than the innermost, which can be beneficial if the outer loop exhibits greater data-level parallelism and locality than the innermost loop. Outer loop vectorization has traditionally been performed by interchanging an outer-loop with the innermost loop, followed by vectorizing it at the innermost position. A more direct unroll-and-jam approach can be used to vectorize an outer-loop without involving loop interchange, which can be especially suitable for short SIMD architectures. In this paper we revisit the method of outer loop vectorization, paying special attention to properties of modern short SIMD architectures. We show that even though current optimizing compilers for such targets do not apply outer-loop vectorization in general, it can provide significant performance improvements over innermost loop vectorization. Our implementation of direct outer-loop vectorization, available in GCC 4.3, achieves speedup factors of 3.13 and 2.77 on average across a set of benchmarks, compared to 1.53 and 1.39 achieved by innermost loop vectorization, when running on a Cell BE SPU and PowerPC970 processors respectively. Moreover, outer-loop vectorization provides new reuse opportunities that can be vital for such short SIMD architectures, including efficient handling of alignment. We present an optimization tapping such opportunities, capable of further boosting the performance obtained by outer-loop vectorization to achieve average speedup factors of 5.26 and 3.64.

[1]  Francky Catthoor,et al.  Pack Transposition: Enhancing Superword Level Parallelism Exploitation , 2005, PARCO.

[2]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[3]  Jaewook Shin,et al.  Compiler-controlled caching in superword register files for multimedia extension architectures , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[4]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[5]  Emmett Witchel,et al.  Increasing and detecting memory address congruence , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[6]  References , 1971 .

[7]  Krste Asanovic,et al.  Compiling for vector-thread architectures , 2008, CGO '08.

[8]  Ayal Zaks,et al.  Auto-vectorization of interleaved data for SIMD , 2006, PLDI '06.

[9]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[10]  Ayal Zaks,et al.  Compiling for an indirect vector register architecture , 2008, CF '08.

[11]  Peng Zhao,et al.  An integrated simdization framework using virtual vectors , 2005, ICS '05.

[12]  Christoforos E. Kozyrakis,et al.  Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks , 2002, MICRO.

[13]  Viet Nhu Ngo Parallel loop transformation techniques for vector-based multiprocessor systems , 1995 .

[14]  Francisco Tirado,et al.  Improving superword level parallelism support in modern compilers , 2005, 2005 Third IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'05).

[15]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[16]  Samuel Williams,et al.  Hardware/compiler codevelopment for an embedded media processor , 2001, Proc. IEEE.

[17]  Ayal Zaks,et al.  Vectorizing for a SIMdD DSP architecture , 2003, CASES '03.

[18]  Ken Kennedy,et al.  PFC: A Program to Convert Fortran to Parallel Form , 1982 .

[19]  Jaewook Shin,et al.  Superword-level parallelism in the presence of control flow , 2005, International Symposium on Code Generation and Optimization.

[20]  Ben Juurlink,et al.  Efficient Vectorization of the FIR Filter Asadollah , 2005 .

[21]  Mateo Valero,et al.  Exploiting a new level of DLP in multimedia applications , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[22]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[23]  Randolph G. Scarborough,et al.  A Vectorizing Fortran Compiler , 1986, IBM J. Res. Dev..

[24]  K. N. Dollman,et al.  - 1 , 1743 .

[25]  Richard Henderson,et al.  Multi-platform auto-vectorization , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[26]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[27]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[28]  Peng Wu,et al.  Efficient SIMD code generation for runtime alignment and length conversion , 2005, International Symposium on Code Generation and Optimization.

[29]  Aart Johannes Casimir Bik The software vectorization handbook , 2004 .

[30]  Kevin B. Smith Support for the Intel ® Pentium ® 4 Processor with Hyper-Threading Technology in Intel ® 8 . 0 Compilers , 2004 .

[31]  Aart J. C. Bik,et al.  Automatic Intra-Register Vectorization for the Intel® Architecture , 2002, International Journal of Parallel Programming.

[32]  Aart J. C. Bik The Software Vectorization Handbook: Apply-ing Multimedia Extensions for Maximum Performance , 2004 .

[33]  Aart J. C. Bik Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance , 2004 .

[34]  Paul B. Schneck,et al.  Automatic recognition of vector and parallel operations in a higher level language , 1972, SIGP.

[35]  Gang Ren,et al.  A Preliminary Study on the Vectorization of Multimedia Applications for Multimedia Extensions , 2003, LCPC.

[36]  Aart J. C. Bik,et al.  Efficient Exploitation of Parallelism on Pentium III and Pentium 4 Processor-Based Systems , 2001 .

[37]  Yoshitoshi Kunieda,et al.  V-Pascal: An automatic vectorizing compiler for Pascal with no language extensions , 1988, Supercomputing '88.