Exploiting vector parallelism in software pipelined loops
暂无分享,去创建一个
[1] Saman P. Amarasinghe,et al. Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.
[2] Peng Wu,et al. Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.
[3] Andreas Krall,et al. Compilation Techniques for Multimedia Processors , 2004, International Journal of Parallel Programming.
[4] Derek J. DeVries. A vectorizing SUIF compiler, implementation and performance , 1997 .
[5] Peng Zhao,et al. An integrated simdization framework using virtual vectors , 2005, ICS '05.
[6] Hunter Scales,et al. AltiVec Extension to PowerPC Accelerates Media Processing , 2000, IEEE Micro.
[7] Jaewook Shin,et al. Compiler-controlled caching in superword register files for multimedia extension architectures , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.
[8] Wen-mei W. Hwu,et al. Modulo scheduling of loops in control-intensive non-numeric programs , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.
[9] Z. Greenfield,et al. The TigerSHARC DSP Architecture , 2000, IEEE Micro.
[10] David A. Padua,et al. Dependence graphs and compiler optimizations , 1981, POPL '81.
[11] Alexandre E. Eichenberger,et al. Effective cluster assignment for modulo scheduling , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.
[12] Alexandre E. Eichenberger,et al. Stage scheduling: a technique to reduce the register requirements of a module schedule , 1995, MICRO 1995.
[13] Peter Kogge,et al. Generation of permutations for SIMD processors , 2005, LCTES '05.
[14] Ruby B. Lee. Subword parallelism with MAX-2 , 1996, IEEE Micro.
[15] Cameron McNairy,et al. Itanium 2 Processor Microarchitecture , 2003, IEEE Micro.
[16] B. Ramakrishna Rau,et al. Register allocation for software pipelined loops , 1992, PLDI '92.
[17] Michael Wolfe,et al. High performance compilers for parallel computing , 1995 .
[18] D. Naishlos,et al. Autovectorization in GCC , 2004 .
[19] Yuan Zhao,et al. Scalarization on Short Vector Machines , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..
[20] R. Govindarajan,et al. A Vectorizing Compiler for Multimedia Extensions , 2000, International Journal of Parallel Programming.
[21] Antonio González,et al. Graph-partitioning based instruction scheduling for clustered processors , 2001, MICRO.
[22] Aart J. C. Bik. Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance , 2004 .
[23] E. Ayguade,et al. Modulo scheduling with integrated register spilling for clustered VLIW architectures , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.
[24] Alexandre E. Eichenberger,et al. Stage scheduling: a technique to reduce the register requirements of a modulo schedule , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.
[25] Jaewook Shin,et al. Superword-level parallelism in the presence of control flow , 2005, International Symposium on Code Generation and Optimization.
[26] Monica S. Lam,et al. RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .
[27] Antonio González,et al. A unified modulo scheduling and register allocation technique for clustered processors , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.
[28] Alain Darte. On the Complexity of Loop Fusion , 2000, Parallel Comput..
[29] S.,et al. An Efficient Heuristic Procedure for Partitioning Graphs , 2022 .
[30] Ken Kennedy,et al. Conversion of control dependence to data dependence , 1983, POPL '83.
[31] Ayal Zaks,et al. Vectorizing for a SIMdD DSP architecture , 2003, CASES '03.
[32] Vladimir M. Pentkovski,et al. Implementing Streaming SIMD Extensions on the Pentium III Processor , 2000, IEEE Micro.
[33] Ken Kennedy,et al. Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .
[34] Peng Wu,et al. Efficient SIMD code generation for runtime alignment and length conversion , 2005, International Symposium on Code Generation and Optimization.
[35] Aart Johannes Casimir Bik. The software vectorization handbook , 2004 .
[36] Dirk Grunwald,et al. A system level perspective on branch architecture performance , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.
[37] James C. Dehnert,et al. Overlapped loop support in the Cydra 5 , 1989, ASPLOS 1989.
[38] Marc Tremblay,et al. VIS speeds new media processing , 1996, IEEE Micro.
[39] Robert E. Tarjan,et al. Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..
[40] Richard A. Huff,et al. Lifetime-sensitive modulo scheduling , 1993, PLDI '93.
[41] Steven W. K. Tjiang,et al. SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.
[42] Andreas Krall,et al. Pointer Alignment Analysis for Processors with SIMD Instructions , 2003 .
[43] Emmett Witchel,et al. Increasing and detecting memory address congruence , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.