A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations

[1]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[2]  G. M.,et al.  Partial Differential Equations I , 2023, Applied Mathematical Sciences.

[3]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[4]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[5]  John Abraham,et al.  Three-dimensional multi-relaxation time (MRT) lattice-Boltzmann models for multiphase flow , 2007, J. Comput. Phys..

[6]  William Jalby,et al.  Optimizing Matrix Operations on a Parallel Multiprocessor with a Memory Hierarchical System , 1986, ICPP.

[7]  Wen-mei W. Hwu,et al.  CUDA-Lite: Reducing GPU Programming Complexity , 2008, LCPC.

[8]  G. Allen,et al.  Supporting Efficient Execution in Heterogeneous Distributed Computing Environments with Cactus and Globus , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[9]  L SteeleGuy,et al.  Fortran at ten gigaflops , 1991 .

[10]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[11]  Sanjay V. Rajopadhye,et al.  Positivity, posynomials and tile size selection , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[13]  Ian T. Foster,et al.  Cactus Application: Performance Predictions in Grid Environments , 2001, Euro-Par.

[14]  Leonid Oliker,et al.  Impact of modern memory subsystems on cache optimizations for stencil computations , 2005, MSP '05.

[15]  Steven J. Deitz,et al.  Eliminating redundancies in sum-of-product array computations , 2001, ICS '01.

[16]  Shirley Dex,et al.  JR 旅客販売総合システム(マルス)における運用及び管理について , 1991 .

[17]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  David G. Wonnacott,et al.  Achieving Scalable Locality with Time Skewing , 2002, International Journal of Parallel Programming.

[19]  Kevin Skadron,et al.  Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[20]  J. R. Gilbert,et al.  Mobile and replicated alignment of arrays in data-parallel programs , 1993, Supercomputing '93. Proceedings.

[21]  Kevin Skadron,et al.  Temperature-aware microarchitecture: Modeling and implementation , 2004, TACO.

[22]  Ulrich Rüde,et al.  Cache-Aware Multigrid Methods for Solving Poisson's Equation in Two Dimensions , 2000, Computing.

[23]  Tarek S. Abdelrahman,et al.  Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..

[24]  PeiZong Lee,et al.  Techniques for Compiling Programs on Distributed Memory Multicomputers , 1995, Parallel Comput..

[25]  Michael Gschwind Chip multiprocessing and the cell broadband engine , 2006, CF '06.

[26]  Openmp: a Proposed Industry Standard Api for Shared Memory Programming , 2022 .

[27]  P. Sadayappan,et al.  Communication-Free Hyperplane Partitioning of Nested Loops , 1993, J. Parallel Distributed Comput..

[28]  Sanjay V. Rajopadhye,et al.  Towards Optimal Multi-level Tiling for Stencil Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[29]  Li Chen,et al.  Redundant computation partition on distributed-memory systems , 2002, Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002. Proceedings..

[30]  David G. Wonnacott,et al.  Time Skewing for Parallel Computers , 1999, LCPC.

[31]  Zhiyuan Li,et al.  Automatic tiling of iterative stencil loops , 2004, TOPL.

[32]  Zhiyi Yang,et al.  Parallel Image Processing Based on CUDA , 2008, 2008 International Conference on Computer Science and Software Engineering.

[33]  Mark Alpert Not Just Fun and Games , 1999 .

[34]  Kevin Skadron,et al.  Compact thermal modeling for temperature-aware design , 2004, Proceedings. 41st Design Automation Conference, 2004..

[35]  J. Ramanujam,et al.  Tiling of Iteration Spaces for Multicomputers , 1990, ICPP.

[36]  Guy L. Steele,et al.  Fortran at ten gigaflops: the connection machine convolution compiler , 1991, PLDI '91.

[37]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.