Auto-vectorization for image processing DSLs

The parallelization of programs and distributing their workloads to multiple threads can be a challenging task. In addition to multi-threading, harnessing vector units in CPUs proves highly desirable. However, employing vector units to speed up programs can be quite tedious. Either a program developer solely relies on the auto-vectorization capabilities of the compiler or he manually applies vector intrinsics, which is extremely error-prone, difficult to maintain, and not portable at all. Based on whole-function vectorization, a method to replace control flow with data flow, we propose auto-vectorization techniques for image processing DSLs in the context of source-to-source compilation. The approach does not require the input to be available in SSA form. Moreover, we formulate constraints under which the vectorization analysis and code transformations may be greatly simplified in the context of image processing DSLs. As part of our methodology, we present control flow to data flow transformation as a source-to-source translation. Moreover, we propose a method to efficiently analyze algorithms with mixed bit-width data types to determine the optimal SIMD width, independently of the target instruction set. The techniques are integrated into an open source DSL framework. Subsequently, the vectorization capabilities are compared to a variety of existing state-of-the-art C/C++ compilers. A geometric mean speedup of up to 3.14 is observed for benchmarks taken from ISPC and image processing, compared to non-vectorized executions.

[1]  Jürgen Teich,et al.  ExaSlang: A Domain-Specific Language for Highly Scalable Multigrid Solvers , 2014, 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing.

[2]  Fridtjof Stein,et al.  Efficient Computation of Optical Flow Using the Census Transform , 2004, DAGM-Symposium.

[3]  Hao Zhou,et al.  Loop-oriented array- and field-sensitive pointer analysis for automatic SIMD vectorization , 2016, LCTES.

[4]  Xinmin Tian,et al.  Reducing the Functionality Gap Between Auto-Vectorization and Explicit Vectorization - Compress/Expand and Histogram , 2016, IWOMP.

[5]  Sebastian Hack,et al.  Sierra: a SIMD extension for C++ , 2014, WPMVP '14.

[6]  Yosi Ben-Asher,et al.  Hybrid type legalization for a sparse SIMD instruction set , 2013, ACM Trans. Archit. Code Optim..

[7]  Jack J. Dongarra,et al.  A comparative study of automatic vectorizing compilers , 1991, Parallel Comput..

[8]  H. Jensen Night Rendering , 2000 .

[9]  Richard Henderson,et al.  Multi-platform auto-vectorization , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[10]  Jaewook Shin,et al.  Superword-level parallelism in the presence of control flow , 2005, International Symposium on Code Generation and Optimization.

[11]  Nalini Vasudevan,et al.  FlexVec: auto-vectorization for irregular loops , 2016, PLDI.

[12]  Sebastian Hack,et al.  Whole-function vectorization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[13]  Jürgen Teich,et al.  HIPAcc: A Domain-Specific Language and Compiler for Image Processing , 2016, IEEE Transactions on Parallel and Distributed Systems.

[14]  R. Govindarajan,et al.  A Vectorizing Compiler for Multimedia Extensions , 2000, International Journal of Parallel Programming.

[15]  Ken Kennedy,et al.  Conversion of control dependence to data dependence , 1983, POPL '83.

[16]  Ayal Zaks,et al.  Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17]  David Padua,et al.  Encyclopedia of Parallel Computing , 2011 .

[18]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[19]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[20]  Andreas Krall,et al.  Compilation Techniques for Multimedia Processors , 2004, International Journal of Parallel Programming.

[21]  Sebastian Hack,et al.  Improving Performance of OpenCL on CPUs , 2012, CC.

[22]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[23]  Mark J. Shensa,et al.  The discrete wavelet transform: wedding the a trous and Mallat algorithms , 1992, IEEE Trans. Signal Process..

[24]  M. Pharr,et al.  ispc: A SPMD compiler for high-performance CPU programming , 2012, 2012 Innovative Parallel Computing (InPar).