论文信息 - An evaluation of current SIMD programming models for C++

An evaluation of current SIMD programming models for C++

SIMD extensions were added to microprocessors in the mid '90s to speed-up data-parallel code by vectorization. Unfortunately, the SIMD programming model has barely evolved and the most efficient utilization is still obtained with elaborate intrinsics coding. As a consequence, several approaches to write efficient and portable SIMD code have been proposed. In this work, we evaluate current programming models for the C++ language, which claim to simplify SIMD programming while maintaining high performance. The proposals were assessed by implementing two kernels: one standard floating-point benchmark and one real-world integer-based application, both highly data parallel. Results show that the proposed solutions perform well for the floating point kernel, achieving close to the maximum possible speed-up. For the real-world application, the programming models exhibit significant performance gaps due to data type issues, missing template support and other problems discussed in this paper.

Ben H. H. Juurlink | Mauricio Alvarez | Chi Ching Chi | Biagio Cosenza | Angela Pohl

[1] Vivek Sarkar,et al. Efficient Selection of Vector Instructions Using Dynamic Programming , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[2] Michael D. McCool,et al. Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[3] Noah Treuhaft,et al. Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[4] José E. Moreira,et al. Simple, portable and fast SIMD intrinsic programming: generic simd library , 2014, WPMVP '14.

[5] Saman P. Amarasinghe,et al. Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[6] Peng Wu,et al. Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[7] Brigitte Rozoy,et al. Boost.SIMD: generic programming for portable SIMDization , 2012, PACT '12.

[8] Magnus Jahre,et al. Optimized hardware for suboptimal software: The case for SIMD-aware benchmarks , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[9] David A. Padua,et al. An Evaluation of Vectorizing Compilers , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[10] M. Pharr,et al. ispc: A SPMD compiler for high-performance CPU programming , 2012, 2012 Innovative Parallel Computing (InPar).

[11] Ayal Zaks,et al. Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[12] Gang Ren,et al. Optimizing data permutations for SIMD devices , 2006, PLDI '06.

[13] Volker Lindenstruth,et al. Vc: A C++ library for explicit vectorization , 2012, Softw. Pract. Exp..

[14] Timothy M. Jones,et al. Throttling Automatic Vectorization: When Less is More , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[15] Ben H. H. Juurlink,et al. SIMD Acceleration for HEVC Decoding , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[16] Timothée Ewart,et al. Cyme: A Library Maximizing SIMD Computation on User-Defined Containers , 2014, ISC.

[17] Mahmut T. Kandemir,et al. A compiler framework for extracting superword level parallelism , 2012, PLDI '12.

[18] Alan Jay Smith,et al. Multimedia Instruction Sets for General Purpose Microprocessors: a , 2000 .

[19] Ingo Wald,et al. Extending a C-like language for portable SIMD programming , 2012, PPoPP '12.

[20] Jaewook Shin,et al. Superword-level parallelism in the presence of control flow , 2005, International Symposium on Code Generation and Optimization.

[21] Peng Wu,et al. Efficient SIMD code generation for runtime alignment and length conversion , 2005, International Symposium on Code Generation and Optimization.

[22] Arch D. Robison,et al. Composable Parallel Patterns with Intel Cilk Plus , 2013, Computing in Science & Engineering.