A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs
暂无分享,去创建一个
[1] Majid Sarrafzadeh,et al. A memory optimization technique for software-managed scratchpad memory in GPUs , 2009, 2009 IEEE 7th Symposium on Application Specific Processors.
[2] Richard W. Vuduc,et al. Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.
[3] Hans-Peter Seidel,et al. Cache oblivious parallelograms in iterative stencil computations , 2010, ICS '10.
[4] Kevin Skadron,et al. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.
[5] Xiaoming Li,et al. A control-structure splitting optimization for GPGPU , 2009, CF '09.
[6] Guohua Jin,et al. Increasing Temporal Locality with Skewing and Recursive Blocking , 2001, ACM/IEEE SC 2001 Conference (SC'01).
[7] Hui Wu,et al. Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs , 2010, 2010 39th International Conference on Parallel Processing.
[8] Peter Bailey,et al. Accelerating Lattice Boltzmann Fluid Flow Simulations Using Graphics Processors , 2009, 2009 International Conference on Parallel Processing.
[9] Michael F. P. O'Boyle,et al. The effect of cache models on iterative compilation for combined tiling and unrolling , 2004, Concurr. Comput. Pract. Exp..
[10] Richard W. Vuduc,et al. Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems , 2009, ICS.
[11] Lam H. Nguyen,et al. Hybrid Core Acceleration of UWB SIRE Radar Signal Processing , 2011, IEEE Transactions on Parallel and Distributed Systems.
[12] Helmar Burkhart,et al. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[13] Gang Ren,et al. A comparison of empirical and model-driven optimization , 2003, PLDI '03.
[14] Ken Kennedy,et al. Profitable loop fusion and tiling using model-driven empirical search , 2006, ICS '06.
[15] Hyesoon Kim,et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.
[16] Y. N. Srikant,et al. Microarchitecture Sensitive Empirical Models for Compiler Optimizations , 2007, International Symposium on Code Generation and Optimization (CGO'07).
[17] Michael F. P. O'Boyle,et al. The effect of cache models on iterative compilation for combined tiling and unrolling: Research Articles , 2004 .
[18] Liang Gu,et al. An empirically tuned 2D and 3D FFT library on CUDA GPU , 2010, ICS '10.
[19] David G. Wonnacott,et al. Achieving Scalable Locality with Time Skewing , 2002, International Journal of Parallel Programming.
[20] Wen-mei W. Hwu,et al. Program optimization space pruning for a multithreaded gpu , 2008, CGO '08.
[21] Paulius Micikevicius,et al. 3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.
[22] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[23] John D. McCalpin,et al. Time Skewing: A Value-Based Approach to Optimizing for Memory Locality , 1999 .
[24] Gang Ren,et al. Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization , 2005, LCPC.
[25] William Gropp,et al. An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.
[26] Chun Chen,et al. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy , 2005, International Symposium on Code Generation and Optimization.
[27] Samuel Williams,et al. Scientific Computing Kernels on the Cell Processor , 2007, International Journal of Parallel Programming.
[28] L. Almagor,et al. Finding effective compilation sequences , 2004, LCTES '04.
[29] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[30] Philippe Bekaert,et al. Optimal Data Distribution for Versatile Finite Impulse Response Filtering on Next-Generation Graphics Hardware Using CUDA , 2009, 2009 15th International Conference on Parallel and Distributed Systems.
[31] Peter Messmer,et al. Parallel data-locality aware stencil computations on modern micro-architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[32] Ken Kennedy,et al. Improving register allocation for subscripted variables , 1990, PLDI '90.
[33] Michael F. P. O'Boyle,et al. Using machine learning to focus iterative optimization , 2006, International Symposium on Code Generation and Optimization (CGO'06).
[34] Zhiyuan Li,et al. New tiling techniques to improve cache temporal locality , 1999, PLDI '99.