An evaluation of current SIMD programming models for C++

SIMD extensions were added to microprocessors in the mid '90s to speed-up data-parallel code by vectorization. Unfortunately, the SIMD programming model has barely evolved and the most efficient utilization is still obtained with elaborate intrinsics coding. As a consequence, several approaches to write efficient and portable SIMD code have been proposed. In this work, we evaluate current programming models for the C++ language, which claim to simplify SIMD programming while maintaining high performance. The proposals were assessed by implementing two kernels: one standard floating-point benchmark and one real-world integer-based application, both highly data parallel. Results show that the proposed solutions perform well for the floating point kernel, achieving close to the maximum possible speed-up. For the real-world application, the programming models exhibit significant performance gaps due to data type issues, missing template support and other problems discussed in this paper.

[1]  Vivek Sarkar,et al.  Efficient Selection of Vector Instructions Using Dynamic Programming , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[2]  Michael D. McCool,et al.  Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[3]  Noah Treuhaft,et al.  Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[4]  José E. Moreira,et al.  Simple, portable and fast SIMD intrinsic programming: generic simd library , 2014, WPMVP '14.

[5]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[6]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[7]  Brigitte Rozoy,et al.  Boost.SIMD: generic programming for portable SIMDization , 2012, PACT '12.

[8]  Magnus Jahre,et al.  Optimized hardware for suboptimal software: The case for SIMD-aware benchmarks , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[9]  David A. Padua,et al.  An Evaluation of Vectorizing Compilers , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[10]  M. Pharr,et al.  ispc: A SPMD compiler for high-performance CPU programming , 2012, 2012 Innovative Parallel Computing (InPar).

[11]  Ayal Zaks,et al.  Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[12]  Gang Ren,et al.  Optimizing data permutations for SIMD devices , 2006, PLDI '06.

[13]  Volker Lindenstruth,et al.  Vc: A C++ library for explicit vectorization , 2012, Softw. Pract. Exp..

[14]  Timothy M. Jones,et al.  Throttling Automatic Vectorization: When Less is More , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[15]  Ben H. H. Juurlink,et al.  SIMD Acceleration for HEVC Decoding , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  Timothée Ewart,et al.  Cyme: A Library Maximizing SIMD Computation on User-Defined Containers , 2014, ISC.

[17]  Mahmut T. Kandemir,et al.  A compiler framework for extracting superword level parallelism , 2012, PLDI '12.

[18]  Alan Jay Smith,et al.  Multimedia Instruction Sets for General Purpose Microprocessors: a , 2000 .

[19]  Ingo Wald,et al.  Extending a C-like language for portable SIMD programming , 2012, PPoPP '12.

[20]  Jaewook Shin,et al.  Superword-level parallelism in the presence of control flow , 2005, International Symposium on Code Generation and Optimization.

[21]  Peng Wu,et al.  Efficient SIMD code generation for runtime alignment and length conversion , 2005, International Symposium on Code Generation and Optimization.

[22]  Arch D. Robison,et al.  Composable Parallel Patterns with Intel Cilk Plus , 2013, Computing in Science & Engineering.