论文信息 - A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations - 字舞流文

A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations

Kevin Skadron | Jiayuan Meng | K. Skadron | Jiayuan Meng

[1] Kevin Skadron,et al. A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[2] G. M.,et al. Partial Differential Equations I , 2023, Applied Mathematical Sciences.

[3] Uday Bondhugula,et al. Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[4] Chau-Wen Tseng,et al. Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[5] John Abraham,et al. Three-dimensional multi-relaxation time (MRT) lattice-Boltzmann models for multiphase flow , 2007, J. Comput. Phys..

[6] William Jalby,et al. Optimizing Matrix Operations on a Parallel Multiprocessor with a Memory Hierarchical System , 1986, ICPP.

[7] Wen-mei W. Hwu,et al. CUDA-Lite: Reducing GPU Programming Complexity , 2008, LCPC.

[8] G. Allen,et al. Supporting Efficient Execution in Heterogeneous Distributed Computing Environments with Cactus and Globus , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[9] L SteeleGuy,et al. Fortran at ten gigaflops , 1991 .

[10] L. Dagum,et al. OpenMP: an industry standard API for shared-memory programming , 1998 .

[11] Sanjay V. Rajopadhye,et al. Positivity, posynomials and tile size selection , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[12] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[13] Ian T. Foster,et al. Cactus Application: Performance Predictions in Grid Environments , 2001, Euro-Par.

[14] Leonid Oliker,et al. Impact of modern memory subsystems on cache optimizations for stencil computations , 2005, MSP '05.

[15] Steven J. Deitz,et al. Eliminating redundancies in sum-of-product array computations , 2001, ICS '01.

[16] Shirley Dex,et al. JR 旅客販売総合システム（マルス）における運用及び管理について , 1991 .

[17] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[18] David G. Wonnacott,et al. Achieving Scalable Locality with Time Skewing , 2002, International Journal of Parallel Programming.

[19] Kevin Skadron,et al. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[20] J. R. Gilbert,et al. Mobile and replicated alignment of arrays in data-parallel programs , 1993, Supercomputing '93. Proceedings.

[21] Kevin Skadron,et al. Temperature-aware microarchitecture: Modeling and implementation , 2004, TACO.

[22] Ulrich Rüde,et al. Cache-Aware Multigrid Methods for Solving Poisson's Equation in Two Dimensions , 2000, Computing.

[23] Tarek S. Abdelrahman,et al. Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..

[24] PeiZong Lee,et al. Techniques for Compiling Programs on Distributed Memory Multicomputers , 1995, Parallel Comput..

[25] Michael Gschwind. Chip multiprocessing and the cell broadband engine , 2006, CF '06.

[26] Openmp: a Proposed Industry Standard Api for Shared Memory Programming , 2022 .

[27] P. Sadayappan,et al. Communication-Free Hyperplane Partitioning of Nested Loops , 1993, J. Parallel Distributed Comput..

[28] Sanjay V. Rajopadhye,et al. Towards Optimal Multi-level Tiling for Stencil Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[29] Li Chen,et al. Redundant computation partition on distributed-memory systems , 2002, Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002. Proceedings..

[30] David G. Wonnacott,et al. Time Skewing for Parallel Computers , 1999, LCPC.

[31] Zhiyuan Li,et al. Automatic tiling of iterative stencil loops , 2004, TOPL.

[32] Zhiyi Yang,et al. Parallel Image Processing Based on CUDA , 2008, 2008 International Conference on Computer Science and Software Engineering.

[33] Mark Alpert. Not Just Fun and Games , 1999 .

[34] Kevin Skadron,et al. Compact thermal modeling for temperature-aware design , 2004, Proceedings. 41st Design Automation Conference, 2004..

[35] J. Ramanujam,et al. Tiling of Iteration Spaces for Multicomputers , 1990, ICPP.

[36] Guy L. Steele,et al. Fortran at ten gigaflops: the connection machine convolution compiler , 1991, PLDI '91.

[37] Volker Strumpen,et al. Cache oblivious stencil computations , 2005, ICS '05.