Revec: program rejuvenation through revectorization

Modern microprocessors are equipped with Single Instruction Multiple Data (SIMD) or vector instructions which expose data level parallelism at a fine granularity. Programmers exploit this parallelism by using low-level vector intrinsics in their code. However, once programs are written using vector intrinsics of a specific instruction set, the code becomes non-portable. Modern compilers are unable to analyze and retarget the code to newer vector instruction sets. Hence, programmers have to manually rewrite the same code using vector intrinsics of a newer generation to exploit higher data widths and capabilities of new instruction sets. This process is tedious, error-prone and requires maintaining multiple code bases. We propose Revec, a compiler optimization pass which revectorizes already vectorized code, by retargeting it to use vector instructions of newer generations. The transformation is transparent, happening at the compiler intermediate representation level, and enables performance portability of hand-vectorized code. Revec can achieve performance improvements in real-world performance critical kernels. In particular, Revec achieves geometric mean speedups of 1.160× and 1.430× on fast integer unpacking kernels, and speedups of 1.145× and 1.195× on hand-vectorized x265 media codec kernels when retargeting their SSE-series implementations to use AVX2 and AVX-512 vector instructions respectively. We also extensively test Revec’s impact on 216 intrinsic-rich implementations of image processing and stencil kernels relative to hand-retargeting.

[1]  Bo Huang,et al.  Optimizing dynamic binary translation for SIMD instructions , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[2]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[3]  Martin C. Rinard,et al.  Taint-based directed whitebox fuzzing , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[4]  José E. Moreira,et al.  Simple, portable and fast SIMD intrinsic programming: generic simd library , 2014, WPMVP '14.

[5]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[6]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[7]  Wei-Chung Hsu,et al.  SIMD Code Translation in an Enhanced HQEMU , 2015, 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS).

[8]  Gilles Pokam,et al.  SWARP: a retargetable preprocessor for multimedia instructions: Research Articles , 2004 .

[9]  Ayal Zaks,et al.  Auto-vectorization of interleaved data for SIMD , 2006, PLDI '06.

[10]  Fred Weber,et al.  AMD 3DNow! technology: architecture and implementations , 1999, IEEE Micro.

[11]  Hao Zhou,et al.  Exploiting mixed SIMD parallelism by reducing data reorganization overhead , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[12]  Volker Lindenstruth,et al.  Vc: A C++ library for explicit vectorization , 2012, Softw. Pract. Exp..

[13]  Björn Franke,et al.  Free Rider: A Tool for Retargeting Platform-Specific Intrinsic Functions , 2015, LCTES.

[14]  Leonid Boytsov,et al.  Decoding billions of integers per second through vectorization , 2012, Softw. Pract. Exp..

[15]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[16]  David A. Padua,et al.  An Evaluation of Vectorizing Compilers , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[17]  Ayal Zaks,et al.  Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[18]  Alexander Zeier,et al.  SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units , 2009, Proc. VLDB Endow..

[19]  Gilles Pokam,et al.  SWARP: a retargetable preprocessor for multimedia instructions , 2004, Concurr. Comput. Pract. Exp..

[20]  M. Wegman,et al.  Global value numbers and redundant computations , 1988, POPL '88.

[21]  Ismail Oukid,et al.  Vectorizing Database Column Scans with Complex Predicates , 2013, ADMS@VLDB.

[22]  Franz Franchetti,et al.  Efficient Utilization of SIMD Extensions , 2005, Proceedings of the IEEE.

[23]  Wei-Chung Hsu,et al.  Improving SIMD Parallelism via Dynamic Binary Translation , 2018, ACM Trans. Embed. Comput. Syst..

[24]  Emmett Witchel,et al.  Increasing and detecting memory address congruence , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[25]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[26]  Franz Franchetti,et al.  Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets , 2011, ICS '11.

[27]  R. Govindarajan,et al.  A Vectorizing Compiler for Multimedia Extensions , 2000, International Journal of Parallel Programming.