Implicit Data Permutation for SIMD Devices

SIMD extension is one of the most effective ways to exploit data level parallelism in current microprocessor design. But limited by some constraints, such as memory address alignment and inconsecutive memory access, data permutation operations are usually needed before SIMD calculations, which impede us to exploit more parallelism. In this paper, an implicit data permutation mechanism is proposed. With our approach, original explicit data permutation can be split into two stages: explicit pattern setting and implicit data reorganization. The first stage is performed by scalar instructions and the second one is triggered implicitly when a vector register is read. It provides new chance for further optimization. To make this mechanism programmable, several new scalar instructions are extended and corresponding compilation strategies are also proposed. Exper- imental results show that oriented to multimedia benchmarks, 1.18x speedup can be achieved over current SIMD optimization techniques on average.

[1]  Francisco Tirado,et al.  Improving superword level parallelism support in modern compilers , 2005, 2005 Third IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'05).

[2]  Ruby B. Lee Subword permutation instructions for two-dimensional multimedia processing in MicroSIMD architectures , 2000, Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors.

[3]  Stamatis Vassiliadis,et al.  Matrix register file and extended subwords: two techniques for embedded media processors , 2005, CF '05.

[4]  Jaewook Shin,et al.  Superword-level parallelism in the presence of control flow , 2005, International Symposium on Code Generation and Optimization.

[5]  Ayal Zaks,et al.  Auto-vectorization of interleaved data for SIMD , 2006, PLDI '06.

[6]  Wonyong Sung,et al.  An FPGA based SIMD processor with a vector memory unit , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[7]  Peter Kogge,et al.  Generation of permutations for SIMD processors , 2005, LCTES '05.

[8]  Franz Franchetti,et al.  Efficient Utilization of SIMD Extensions , 2005, Proceedings of the IEEE.

[9]  Chun Chen,et al.  Model-Guided Empirical Optimization for Multimedia Extension Architectures: A Case Study , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[10]  Ayal Zaks,et al.  Vectorizing for a SIMdD DSP architecture , 2003, CASES '03.

[11]  Emmett Witchel,et al.  Increasing and detecting memory address congruence , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[12]  Erik Lindholm,et al.  A user-programmable vertex engine , 2001, SIGGRAPH.

[13]  Francky Catthoor,et al.  Pack Transposition: Enhancing Superword Level Parallelism Exploitation , 2005, PARCO.

[14]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[15]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[16]  Gang Ren,et al.  Optimizing data permutations for SIMD devices , 2006, PLDI '06.

[17]  Michael Gschwind,et al.  Optimizing Compiler for the CELL Processor , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[18]  Jaewook Shin,et al.  Compiler-controlled caching in superword register files for multimedia extension architectures , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[19]  Xiaobo Sharon Hu,et al.  Linear-time matrix transpose algorithms using vector register file with diagonal registers , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.