Automatic SIMD Vectorization of SSA-based Control Flow Graphs

Ralf Karrenberg presents Whole-Function Vectorization (WFV), an approach that allows a compiler to automatically create code that exploits data-parallelism using SIMD instructions. Data-parallel applications such as particle simulations, stock option price estimation or video decoding require the same computations to be performed on huge amounts of data. Without WFV, one processor core executes a single instance of a data-parallel function. WFV transforms the function to execute multiple instances at once using SIMD instructions. The author describes an advanced WFV algorithm that includes a variety of analyses and code generation techniques. He shows that this approach improves the performance of the generated code in a variety of use cases.

[1]  R. Govindarajan,et al.  A Vectorizing Compiler for Multimedia Extensions , 2000, International Journal of Parallel Programming.

[2]  Yooseong Kim,et al.  CuMAPz: A tool to analyze memory access patterns in CUDA , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[3]  Stavros Tripakis,et al.  Checking Equivalence of SPMD Programs Using Non- Interference , 2010 .

[4]  Mike Murphy,et al.  Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs , 2010, CGO '10.

[5]  Guodong Li,et al.  Scalable SMT-based verification of GPU kernel functions , 2010, FSE '10.

[6]  Randolph G. Scarborough,et al.  A Vectorizing Fortran Compiler , 1986, IBM J. Res. Dev..

[7]  M. Schlansker,et al.  On Predicated Execution , 1991 .

[8]  Henk Corporaal,et al.  Making graphs reducible with controlled node splitting , 1997, TOPL.

[9]  Marc Olano Modified noise for evaluation on graphics hardware , 2005, HWWS '05.

[10]  Viet Nhu Ngo Parallel loop transformation techniques for vector-based multiprocessor systems , 1995 .

[11]  Alejandro Duran,et al.  Extending OpenMP* with Vector Constructs for Modern Multicore SIMD Architectures , 2012, IWOMP.

[12]  Richard Henderson,et al.  Multi-platform auto-vectorization , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[13]  Scott A. Mahlke,et al.  SIMD defragmenter: efficient ILP realization on data-parallel architectures , 2012, ASPLOS XVII.

[14]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[15]  Mattan Erez,et al.  Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation , 2013, ISCA.

[16]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[17]  G. Ramalingam,et al.  On loops, dominators, and dominance frontiers , 2002, TOPL.

[18]  Sebastian Hack,et al.  Whole-function vectorization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[19]  Guodong Li,et al.  Performance Degradation Analysis of GPU Kernels , 2012 .

[20]  Denis Barthou,et al.  On the decidability of phase ordering problem in optimizing compilation , 2006, CF '06.

[21]  Sebastian Hack,et al.  Sierra: a SIMD extension for C++ , 2014, WPMVP '14.

[22]  Philipp Slusallek,et al.  AnySL: efficient and portable shading for ray tracing , 2010, HPG '10.

[23]  Michael D. McCool,et al.  Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[24]  Jaewook Shin Introducing Control Flow into Vectorized Code , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[25]  Francisco Vázquez,et al.  A new approach for sparse matrix vector product on NVIDIA GPUs , 2011, Concurr. Comput. Pract. Exp..

[26]  Yi Yang,et al.  A unified optimizing compiler framework for different GPGPU architectures , 2012, TACO.

[27]  Thomas Sturm,et al.  Presburger Arithmetic in Memory Access Optimization for Data-Parallel Languages , 2013, FroCos.

[28]  S. Boulos,et al.  RTSL: a Ray Tracing Shading Language , 2007, 2007 IEEE Symposium on Interactive Ray Tracing.

[29]  Scott A. Mahlke,et al.  MacroSS: macro-SIMDization of streaming applications , 2010, ASPLOS XV.

[30]  Robert E. Tarjan,et al.  A fast algorithm for finding dominators in a flowgraph , 1979, TOPL.

[31]  Ken Perlin,et al.  Improving noise , 2002, SIGGRAPH.

[32]  Milind Girkar,et al.  Compiling C/C++ SIMD Extensions for Function and Loop Vectorizaion on Multicore-SIMD Processors , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[33]  Stamatis Vassiliadis,et al.  Performance Impact of Misaligned Accesses in SIMD Extensions , 2006 .

[34]  Ingo Wald,et al.  Extending a C-like language for portable SIMD programming , 2012, PPoPP '12.

[35]  Jaewook Shin,et al.  Superword-level parallelism in the presence of control flow , 2005, International Symposium on Code Generation and Optimization.

[36]  Krste Asanovic,et al.  Convergence and scalarization for data-parallel architectures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[37]  Jim X. Chen,et al.  OpenGL Shading Language , 2009 .

[38]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[39]  Edward S. Lowry,et al.  Object code optimization , 1969, CACM.

[40]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[41]  Des Watson,et al.  A study of irreducibility in C programs , 2012, Softw. Pract. Exp..

[42]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[43]  Seonggun Kim,et al.  Efficient SIMD code generation for irregular kernels , 2012, PPoPP '12.

[44]  Michael F. P. O'Boyle,et al.  A large-scale cross-architecture evaluation of thread-coarsening , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[45]  Jarmo Takala,et al.  OpenCL-based design methodology for application-specific processors , 2010, 2010 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[46]  Ayal Zaks,et al.  Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[47]  Ken Perlin,et al.  [Computer Graphics]: Three-Dimensional Graphics and Realism , 2022 .

[48]  Hye-Sun Kim,et al.  Cache-oblivious ray reordering , 2010, TOGS.

[49]  Ayal Zaks,et al.  Auto-vectorization of interleaved data for SIMD , 2006, PLDI '06.

[50]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[51]  Volker Weispfenning The Complexity of Almost Linear Diophantine Problems , 1990, J. Symb. Comput..

[52]  Yosi Ben-Asher,et al.  Block Unification IF-conversion for High Performance Architectures , 2014, IEEE Computer Architecture Letters.

[53]  Sebastian Hack,et al.  Improving Performance of OpenCL on CPUs , 2012, CC.

[54]  Fernando Magno Quintão Pereira,et al.  Divergence analysis , 2013, ACM Trans. Program. Lang. Syst..

[55]  M. Pharr,et al.  ispc: A SPMD compiler for high-performance CPU programming , 2012, 2012 Innovative Parallel Computing (InPar).

[56]  Ingo Wald Active thread compaction for GPU path tracing , 2011, HPG '11.

[57]  Andreas Krall,et al.  Compilation Techniques for Multimedia Processors , 2004, International Journal of Parallel Programming.

[58]  Sid Touati,et al.  The Speedup Test , 2010 .

[59]  Wen-mei W. Hwu,et al.  MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs , 2008, LCPC.

[60]  John Sartori,et al.  Branch and Data Herding: Reducing Control and Memory Divergence for Error-Tolerant GPU Applications , 2013, IEEE Trans. Multim..

[61]  Volker Lindenstruth,et al.  Vc: A C++ library for explicit vectorization , 2012, Softw. Pract. Exp..