SIMDization of Small Tensor Multiplication Kernels for Wide SIMD Vector Processors

Developers often rely on automatic vectorization to speed up fine-grained data-parallel code. However, for loop nests where the loops are shorter than the processor's SIMD width, automatic vectorization performs poorly. Vectorizers attempt to vectorize a single short loop, using (at best) a fraction of the processor's SIMD capacity. It is not straightforward to vectorize multiple nested loops together because they typically have memory accesses with multiple strides, which conventional methods cannot profitably vectorize. We present a solution in the context of compiling small tensor multiplication. Our compiler vectorizes several inner loops in order to utilize wide vector parallelism. To handle complicated strides, we devise a vectorizable form of loop tiling. The compiler transforms loops to improve memory locality, then caches tiles of data in vector registers. Strided access patterns are transformed into permute instructions. We show that our compiler is able to significantly speed up many small tensor multiplication algorithms. It judges 13.5% of a randomly generated sample of algorithms to be profitable to vectorize. On these, it generates code 1.55x as fast on average as that produced by GCC's state-of-the-art vectorizer, with a maximum speedup of 10x. We discuss potential extensions to vectorize more general algorithms.

[1]  Ayal Zaks,et al.  Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[2]  Albert Cohen,et al.  Polyhedral-Model Guided Loop-Nest Auto-Vectorization , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[3]  Timothy M. Jones,et al.  PSLP: Padded SLP automatic vectorization , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[4]  David E. Bernholdt,et al.  Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.

[5]  Shoaib Kamil,et al.  The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[6]  Peng Zhao,et al.  An integrated simdization framework using virtual vectors , 2005, ICS '05.

[7]  Gang Ren,et al.  Optimizing data permutations for SIMD devices , 2006, PLDI '06.

[8]  Guy E. Blelloch,et al.  Implementation of a portable nested data-parallel language , 1993, PPOPP '93.

[9]  Richard Veras,et al.  When polyhedral transformations meet SIMD code generation , 2013, PLDI.

[10]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[11]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[12]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[13]  Jaewook Shin,et al.  Compiler-controlled caching in superword register files for multimedia extension architectures , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[14]  Xinmin Tian,et al.  Effective SIMD Vectorization for Intel Xeon Phi Coprocessors , 2015, Sci. Program..

[15]  Mahmut T. Kandemir,et al.  A compiler framework for extracting superword level parallelism , 2012, PLDI '12.