Efficient SIMD code generation for irregular kernels

Array indirection causes several challenges for compilers to utilize single instruction, multiple data (SIMD) instructions. Disjoint memory references, arbitrarily misaligned memory references, and dependence cycles in loops are main challenges to handle for SIMD compilers. Due to those challenges, existing SIMD compilers have excluded loops with array indirection from their candidate loops for SIMD vectorization. However, addressing those challenges is inevitable, since many important compute-intensive applications extensively use array indirection to reduce memory and computation requirements. In this work, we propose a method to generate efficient SIMD code for loops containing indirected memory references. We extract both inter- and intra-iteration parallelism, taking data reorganization overhead into consideration. We also optimally place data reorganization code in order to amortize the reorganization overhead through the performance gain of SIMD vectorization. Experiments on four array indirection kernels, which are extracted from real-world scientific applications, show that our proposed method effectively generates SIMD code for irregular kernels with array indirection. Compared to the existing SIMD vectorization methods, our proposed method significantly improves the performance of irregular kernels by 91%, on average.

[1]  Hongbin Zheng,et al.  Polly – Polyhedral optimization in LLVM , 2012 .

[2]  Wonyong Sung,et al.  Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware , 2008, CASES '08.

[3]  Vivek Sarkar,et al.  Efficient Selection of Vector Instructions Using Dynamic Programming , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  Joel H. Saltz,et al.  Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures , 1994, J. Parallel Distributed Comput..

[5]  Andreas Krall,et al.  Compilation Techniques for Multimedia Processors , 2004, International Journal of Parallel Programming.

[6]  Vikram S. Adve,et al.  Macroscopic Data Structure Analysis and Optimization , 2005 .

[7]  Rainer Leupers Code selection for media processors with SIMD instructions , 2000, DATE '00.

[8]  Andreas Krall,et al.  Pointer Alignment Analysis for Processors with SIMD Instructions , 2003 .

[9]  Peng Zhao,et al.  An integrated simdization framework using virtual vectors , 2005, ICS '05.

[10]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[11]  Gang Ren,et al.  Optimizing data permutations for SIMD devices , 2006, PLDI '06.

[12]  Rodric M. Rabbah,et al.  Exploiting vector parallelism in software pipelined loops , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[13]  Martin Hopkins,et al.  Synergistic Processing in Cell's Multicore Architecture , 2006, IEEE Micro.

[14]  Gang Ren,et al.  A Preliminary Study on the Vectorization of Multimedia Applications for Multimedia Extensions , 2003, LCPC.

[15]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[16]  Robert E. Tarjan,et al.  Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..

[17]  R. Govindarajan,et al.  A Vectorizing Compiler for Multimedia Extensions , 2000, International Journal of Parallel Programming.

[18]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[19]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[20]  Ayal Zaks,et al.  Auto-vectorization of interleaved data for SIMD , 2006, PLDI '06.

[21]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[22]  Hunter Scales,et al.  AltiVec Extension to PowerPC Accelerates Media Processing , 2000, IEEE Micro.

[23]  Ayal Zaks,et al.  Vectorizing for a SIMdD DSP architecture , 2003, CASES '03.

[24]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.