A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations
暂无分享,去创建一个
[1] Kevin Skadron,et al. A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..
[2] G. M.,et al. Partial Differential Equations I , 2023, Applied Mathematical Sciences.
[3] Uday Bondhugula,et al. Effective automatic parallelization of stencil computations , 2007, PLDI '07.
[4] Chau-Wen Tseng,et al. Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).
[5] John Abraham,et al. Three-dimensional multi-relaxation time (MRT) lattice-Boltzmann models for multiphase flow , 2007, J. Comput. Phys..
[6] William Jalby,et al. Optimizing Matrix Operations on a Parallel Multiprocessor with a Memory Hierarchical System , 1986, ICPP.
[7] Wen-mei W. Hwu,et al. CUDA-Lite: Reducing GPU Programming Complexity , 2008, LCPC.
[8] G. Allen,et al. Supporting Efficient Execution in Heterogeneous Distributed Computing Environments with Cactus and Globus , 2001, ACM/IEEE SC 2001 Conference (SC'01).
[9] L SteeleGuy,et al. Fortran at ten gigaflops , 1991 .
[10] L. Dagum,et al. OpenMP: an industry standard API for shared-memory programming , 1998 .
[11] Sanjay V. Rajopadhye,et al. Positivity, posynomials and tile size selection , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[12] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).
[13] Ian T. Foster,et al. Cactus Application: Performance Predictions in Grid Environments , 2001, Euro-Par.
[14] Leonid Oliker,et al. Impact of modern memory subsystems on cache optimizations for stencil computations , 2005, MSP '05.
[15] Steven J. Deitz,et al. Eliminating redundancies in sum-of-product array computations , 2001, ICS '01.
[16] Shirley Dex,et al. JR 旅客販売総合システム(マルス)における運用及び管理について , 1991 .
[17] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[18] David G. Wonnacott,et al. Achieving Scalable Locality with Time Skewing , 2002, International Journal of Parallel Programming.
[19] Kevin Skadron,et al. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.
[20] J. R. Gilbert,et al. Mobile and replicated alignment of arrays in data-parallel programs , 1993, Supercomputing '93. Proceedings.
[21] Kevin Skadron,et al. Temperature-aware microarchitecture: Modeling and implementation , 2004, TACO.
[22] Ulrich Rüde,et al. Cache-Aware Multigrid Methods for Solving Poisson's Equation in Two Dimensions , 2000, Computing.
[23] Tarek S. Abdelrahman,et al. Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..
[24] PeiZong Lee,et al. Techniques for Compiling Programs on Distributed Memory Multicomputers , 1995, Parallel Comput..
[25] Michael Gschwind. Chip multiprocessing and the cell broadband engine , 2006, CF '06.
[26] Openmp: a Proposed Industry Standard Api for Shared Memory Programming , 2022 .
[27] P. Sadayappan,et al. Communication-Free Hyperplane Partitioning of Nested Loops , 1993, J. Parallel Distributed Comput..
[28] Sanjay V. Rajopadhye,et al. Towards Optimal Multi-level Tiling for Stencil Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[29] Li Chen,et al. Redundant computation partition on distributed-memory systems , 2002, Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002. Proceedings..
[30] David G. Wonnacott,et al. Time Skewing for Parallel Computers , 1999, LCPC.
[31] Zhiyuan Li,et al. Automatic tiling of iterative stencil loops , 2004, TOPL.
[32] Zhiyi Yang,et al. Parallel Image Processing Based on CUDA , 2008, 2008 International Conference on Computer Science and Software Engineering.
[33] Mark Alpert. Not Just Fun and Games , 1999 .
[34] Kevin Skadron,et al. Compact thermal modeling for temperature-aware design , 2004, Proceedings. 41st Design Automation Conference, 2004..
[35] J. Ramanujam,et al. Tiling of Iteration Spaces for Multicomputers , 1990, ICPP.
[36] Guy L. Steele,et al. Fortran at ten gigaflops: the connection machine convolution compiler , 1991, PLDI '91.
[37] Volker Strumpen,et al. Cache oblivious stencil computations , 2005, ICS '05.