A compiler framework for extracting superword level parallelism

SIMD (single-instruction multiple-data) instruction set extensions are quite common today in both high performance and embedded microprocessors, and enable the exploitation of a specific type of data parallelism called SLP (Superword Level Parallelism). While prior research shows that significant performance savings are possible when SLP is exploited, placing SIMD instructions in an application code manually can be very difficult and error prone. In this paper, we propose a novel automated compiler framework for improving superword level parallelism exploitation. The key part of our framework consists of two stages: superword statement generation and data layout optimization. The first stage is our main contribution and has two phases, statement grouping and statement scheduling, of which the primary goals are to increase SIMD parallelism and, more importantly, capture more superword reuses among the superword statements through global data access and reuse pattern analysis. Further, as a complementary optimization, our data layout optimization organizes data in memory space such that the price of memory operations for SLP is minimized. The results from our compiler implementation and tests on two systems indicate performance improvements as high as 15.2% over a state-of-the-art SLP optimization algorithm.

[1]  Rainer Leupers,et al.  A SIMD optimization framework for retargetable compilers , 2009, TACO.

[2]  Ayal Zaks,et al.  Auto-vectorization of interleaved data for SIMD , 2006, PLDI '06.

[3]  Francky Catthoor,et al.  Pack Transposition: Enhancing Superword Level Parallelism Exploitation , 2005, PARCO.

[4]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[5]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[6]  Jaewook Shin,et al.  Compiler-controlled caching in superword register files for multimedia extension architectures , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[7]  Rainer Leupers,et al.  A uniform optimization technique for offset assignment problems , 1998, Proceedings. 11th International Symposium on System Synthesis (Cat. No.98EX210).

[8]  Mary Hall,et al.  Compiler optimizations for architectures supporting superword-level parallelism , 2005 .

[9]  Gang Ren,et al.  Optimizing data permutations for SIMD devices , 2006, PLDI '06.

[10]  Vivek Sarkar,et al.  Efficient Selection of Vector Instructions Using Dynamic Programming , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[11]  Ken Kennedy,et al.  Relaxing SIMD control flow constraints using loop transformations , 1992, PLDI '92.

[12]  Andreas Krall,et al.  Compilation Techniques for Multimedia Processors , 2004, International Journal of Parallel Programming.

[13]  Derek J. DeVries A vectorizing SUIF compiler, implementation and performance , 1997 .

[14]  Jaewook Shin,et al.  Exploiting Superword-Level Locality in Multimedia Extension Architectures , 2003, J. Instr. Level Parallelism.

[15]  Ayal Zaks,et al.  Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[16]  Michael Kagan,et al.  The Pentium" Processor with MMXTM Technology , 1997 .

[17]  R. Govindarajan,et al.  A Vectorizing Compiler for Multimedia Extensions , 2000, International Journal of Parallel Programming.

[18]  Corinna G. Lee,et al.  Initial results on the performance and cost of vector microprocessors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[19]  Michael Weiss Strip mining on SIMD architectures , 1991, ICS '91.

[20]  Steven W. K. Tjiang,et al.  SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.

[21]  Aart J. C. Bik,et al.  Automatic Intra-Register Vectorization for the Intel® Architecture , 2002, International Journal of Parallel Programming.

[22]  Francisco Tirado,et al.  Improving superword level parallelism support in modern compilers , 2005, 2005 Third IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'05).

[23]  Corinna G. Lee,et al.  Simple vector microprocessors for multimedia applications , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[24]  Fred Weber,et al.  AMD 3DNow! technology: architecture and implementations , 1999, IEEE Micro.

[25]  Franz Franchetti,et al.  Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures , 2011, CC.

[26]  Samuel Larsen,et al.  Compilation techniques for short-vector instructions , 2006 .

[27]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[28]  Peng Wu,et al.  Efficient SIMD code generation for runtime alignment and length conversion , 2005, International Symposium on Code Generation and Optimization.

[29]  Jaewook Shin,et al.  Superword-level parallelism in the presence of control flow , 2005, International Symposium on Code Generation and Optimization.