Exploiting vector parallelism in software pipelined loops

An emerging trend in processor design is the addition of short vector instructions to general-purpose and embedded ISAs. Frequently, these extensions are employed using traditional vectorization technology first developed for supercomputers. In contrast, scalar hardware is typically targeted using ILP techniques such as software pipelining. This paper presents a novel approach for exploiting vector parallelism in software pipelined loops. The proposed methodology (i) lowers the burden on the scalar resources by offloading computation to the vector functional units, (ii) explicitly manages communication of operands between scalar and vector instructions, (in) naturally handles misaligned vector memory operations, and (iv) partially (or fully) inhibits the optimization when vectorization will decrease performance. Our approach results in better resource utilization and allows for software pipelining with shorter initiation intervals. The proposed optimization is applied in the compiler backend, where vectorization decisions are more amenable to cost analysis. This is unique in that traditional vectorization optimizations are usually carried out at the statement level. Although our technique most naturally complements statically scheduled machines, we believe it is applicable to any architecture that tightly integrates support for instruction and data level parallelism. We evaluate our methodology using nine SPEC FP benchmarks. In comparison to software pipelining, our approach achieves a maximum speedup of 1.38times, with an average of 1.11times

[1]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[2]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[3]  Andreas Krall,et al.  Compilation Techniques for Multimedia Processors , 2004, International Journal of Parallel Programming.

[4]  Derek J. DeVries A vectorizing SUIF compiler, implementation and performance , 1997 .

[5]  Peng Zhao,et al.  An integrated simdization framework using virtual vectors , 2005, ICS '05.

[6]  Hunter Scales,et al.  AltiVec Extension to PowerPC Accelerates Media Processing , 2000, IEEE Micro.

[7]  Jaewook Shin,et al.  Compiler-controlled caching in superword register files for multimedia extension architectures , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[8]  Wen-mei W. Hwu,et al.  Modulo scheduling of loops in control-intensive non-numeric programs , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[9]  Z. Greenfield,et al.  The TigerSHARC DSP Architecture , 2000, IEEE Micro.

[10]  David A. Padua,et al.  Dependence graphs and compiler optimizations , 1981, POPL '81.

[11]  Alexandre E. Eichenberger,et al.  Effective cluster assignment for modulo scheduling , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[12]  Alexandre E. Eichenberger,et al.  Stage scheduling: a technique to reduce the register requirements of a module schedule , 1995, MICRO 1995.

[13]  Peter Kogge,et al.  Generation of permutations for SIMD processors , 2005, LCTES '05.

[14]  Ruby B. Lee Subword parallelism with MAX-2 , 1996, IEEE Micro.

[15]  Cameron McNairy,et al.  Itanium 2 Processor Microarchitecture , 2003, IEEE Micro.

[16]  B. Ramakrishna Rau,et al.  Register allocation for software pipelined loops , 1992, PLDI '92.

[17]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[18]  D. Naishlos,et al.  Autovectorization in GCC , 2004 .

[19]  Yuan Zhao,et al.  Scalarization on Short Vector Machines , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[20]  R. Govindarajan,et al.  A Vectorizing Compiler for Multimedia Extensions , 2000, International Journal of Parallel Programming.

[21]  Antonio González,et al.  Graph-partitioning based instruction scheduling for clustered processors , 2001, MICRO.

[22]  Aart J. C. Bik Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance , 2004 .

[23]  E. Ayguade,et al.  Modulo scheduling with integrated register spilling for clustered VLIW architectures , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[24]  Alexandre E. Eichenberger,et al.  Stage scheduling: a technique to reduce the register requirements of a modulo schedule , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[25]  Jaewook Shin,et al.  Superword-level parallelism in the presence of control flow , 2005, International Symposium on Code Generation and Optimization.

[26]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[27]  Antonio González,et al.  A unified modulo scheduling and register allocation technique for clustered processors , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[28]  Alain Darte On the Complexity of Loop Fusion , 2000, Parallel Comput..

[29]  S.,et al.  An Efficient Heuristic Procedure for Partitioning Graphs , 2022 .

[30]  Ken Kennedy,et al.  Conversion of control dependence to data dependence , 1983, POPL '83.

[31]  Ayal Zaks,et al.  Vectorizing for a SIMdD DSP architecture , 2003, CASES '03.

[32]  Vladimir M. Pentkovski,et al.  Implementing Streaming SIMD Extensions on the Pentium III Processor , 2000, IEEE Micro.

[33]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[34]  Peng Wu,et al.  Efficient SIMD code generation for runtime alignment and length conversion , 2005, International Symposium on Code Generation and Optimization.

[35]  Aart Johannes Casimir Bik The software vectorization handbook , 2004 .

[36]  Dirk Grunwald,et al.  A system level perspective on branch architecture performance , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[37]  James C. Dehnert,et al.  Overlapped loop support in the Cydra 5 , 1989, ASPLOS 1989.

[38]  Marc Tremblay,et al.  VIS speeds new media processing , 1996, IEEE Micro.

[39]  Robert E. Tarjan,et al.  Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..

[40]  Richard A. Huff,et al.  Lifetime-sensitive modulo scheduling , 1993, PLDI '93.

[41]  Steven W. K. Tjiang,et al.  SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.

[42]  Andreas Krall,et al.  Pointer Alignment Analysis for Processors with SIMD Instructions , 2003 .

[43]  Emmett Witchel,et al.  Increasing and detecting memory address congruence , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.