Model-Driven Tile Size Selection for DOACROSS Loops on GPUs
暂无分享,去创建一个
[1] Rudolf Eigenmann,et al. OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.
[2] Richard W. Vuduc,et al. Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.
[3] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.
[4] Yi Yang,et al. A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.
[5] Dongrui Fan,et al. Extendable pattern-oriented optimization directives , 2011, CGO 2011.
[6] Yang Yang,et al. Automatic Library Generation for BLAS3 on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[7] Tomofumi Yuki,et al. Automatic creation of tile size selection models , 2010, CGO '10.
[8] Xipeng Shen,et al. A cross-input adaptive framework for GPU program optimizations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[9] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[10] William Gropp,et al. An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.
[11] J. Ramanujam,et al. Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.
[12] Hyesoon Kim,et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.
[13] A. Quarteroni,et al. Numerical Approximation of Partial Differential Equations , 2008 .
[14] Wu-chun Feng,et al. Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[15] Hui Wu,et al. Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs , 2010, 2010 39th International Conference on Parallel Processing.
[16] Jingling Xue,et al. Code tiling for improving the cache performance of PDE solvers , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..
[17] J. Craggs. Applied Mathematical Sciences , 1973 .
[18] Dongrui Fan,et al. Extendable pattern-oriented optimization directives , 2012, International Symposium on Code Generation and Optimization (CGO 2011).
[19] W. Hackbusch. Iterative Solution of Large Sparse Systems of Equations , 1993 .
[20] Th. Lippert,et al. A parallel SSOR preconditioner for lattice QCD , 1996, hep-lat/9608066.
[21] Jingling Xue,et al. Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.
[22] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.