A review of CUDA optimization techniques and tools for structured grid computing
暂无分享,去创建一个
[1] Naoya Maruyama,et al. Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .
[2] Benoît Meister,et al. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction , 2010, GPGPU-3.
[3] Fayez Gebali,et al. Algorithms and Parallel Computing: Gebali/Algorithms and Parallel Computing , 2011 .
[4] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[5] Wu-chun Feng,et al. Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[6] Fayez Gebali,et al. Algorithms and Parallel Computing , 2011 .
[7] Wen-mei W. Hwu,et al. CUDA-Lite: Reducing GPU Programming Complexity , 2008, LCPC.
[8] Mayez A. Al-Mouhamed,et al. RT-CUDA: A Software Tool for CUDA Code Restructuring , 2016, International Journal of Parallel Programming.
[9] Anders Logg,et al. Unified form language: A domain-specific language for weak formulations of partial differential equations , 2012, TOMS.
[10] Yuping Zhang,et al. Optimizing sparse matrix-vector multiplication on CUDA , 2010, 2010 2nd International Conference on Education Technology and Computer.
[11] Samuel Williams,et al. Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[12] Wilfred Pinfold,et al. Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis , 2009, HiPC 2009.
[13] P. Sadayappan,et al. Characterizing dataset dependence for sparse matrix-vector multiplication on GPUs , 2015 .
[14] Jack J. Dongarra,et al. Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..
[15] Greg van Anders,et al. Accelerating the solution of families of shifted linear systems with CUDA , 2011 .
[16] 강성호. US PATENT AND TRADEMARK OFFICE 등 , 1999 .
[17] Zhang Qian,et al. A new method of Sparse Matrix-Vector Multiplication on GPU , 2012, Proceedings of 2012 2nd International Conference on Computer Science and Network Technology.
[18] Michael Garland,et al. Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[19] L. Dagum,et al. OpenMP: an industry standard API for shared-memory programming , 1998 .
[20] P. Sadayappan,et al. High-performance sparse matrix-vector multiplication on GPUs for structured grid computations , 2012, GPGPU-5.
[21] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.
[22] Rudolf Eigenmann,et al. OpenMPC: extended OpenMP for efficient programming and tuning on GPUs , 2013, Int. J. Comput. Sci. Eng..
[23] Christoph W. Kessler,et al. SkePU 2: Flexible and Type-Safe Skeleton Programming for Heterogeneous Parallel Systems , 2018, International Journal of Parallel Programming.
[24] Zhimin Li,et al. An improved sparse matrix-vector multiplication kernel for solving modified equation in large scale power flow calculation on CUDA , 2012, Proceedings of The 7th International Power Electronics and Motion Control Conference.
[25] Jacqueline Chame,et al. A script-based autotuning compiler system to generate high-performance CUDA code , 2013, TACO.
[26] Shoaib Kamil,et al. OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[27] Samuel Williams,et al. A Generalized Framework for Auto-tuning Stencil Computations , 2009 .
[28] Peter Kilpatrick,et al. A parallel pattern for iterative stencil + reduce , 2016, The Journal of Supercomputing.
[29] Tarek S. Abdelrahman,et al. hiCUDA: High-Level GPGPU Programming , 2011, IEEE Transactions on Parallel and Distributed Systems.
[30] François Bodin,et al. Heterogeneous multicore parallel programming for graphics processing units , 2009, Sci. Program..
[31] Diego Alejandro Rivera-Polanco. COLLECTIVE COMMUNICATION AND BARRIER SYNCHRONIZATION ON NVIDIA CUDA GPU , 2009 .
[32] Bronis R. de Supinski,et al. OpenMP for Accelerators , 2011, IWOMP.
[33] Laura S. Hjerpe,et al. U.S. Patent and Trademark Office , 2003 .
[34] Nazeeruddin Mohammad,et al. Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core , 2015, International Journal of Parallel Programming.
[35] Klaus Mueller,et al. Why do commodity graphics hardware boards (GPUs) work so well for acceleration of computed tomography? , 2007, Electronic Imaging.
[36] P. Sadayappan,et al. Stencil-Aware GPU Optimization of Iterative Solvers , 2013, SIAM J. Sci. Comput..
[37] Jack J. Dongarra,et al. Acceleration of GPU-based Krylov solvers via data transfer reduction , 2015, Int. J. High Perform. Comput. Appl..
[38] Zhaohui Du,et al. Data and computation transformations for Brook streaming applications on multiprocessors , 2006, International Symposium on Code Generation and Optimization (CGO'06).
[39] Boyana Norris,et al. Autotuning Stencil-Based Computations on GPUs , 2012, 2012 IEEE International Conference on Cluster Computing.
[40] Mayez A. Al-Mouhamed,et al. SpMV and BiCG-Stab optimization for a class of hepta-diagonal-sparse matrices on GPU , 2017, The Journal of Supercomputing.
[41] Jiaquan Gao,et al. Efficient CSR-Based Sparse Matrix-Vector Multiplication on GPU , 2016 .
[42] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[43] Samuel Williams,et al. An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[44] Pat Hanrahan,et al. Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..
[45] Mark S. Peercy,et al. A performance-oriented data parallel virtual machine for GPUs , 2006, SIGGRAPH '06.
[46] Srinivasan Parthasarathy,et al. Automatic Selection of Sparse Matrix Representation on GPUs , 2015, ICS.
[47] Satoshi Matsuoka,et al. High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning , 2010, Computer Science - Research and Development.