Accelerating Scientific Applications using High Performance Dense and Sparse Linear Algebra Kernels on GPUs
暂无分享,去创建一个
[1] David E. Keyes,et al. Pipelining Computational Stages of the Tomographic Reconstructor for Multi-Object Adaptive Optics on a Multi-GPU System , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[2] James Demmel,et al. ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, PARA.
[3] Bertil Schmidt,et al. CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations , 2013, Parallel Comput..
[4] P. Sadayappan,et al. High-performance sparse matrix-vector multiplication on GPUs for structured grid computations , 2012, GPGPU-5.
[5] Jack J. Dongarra,et al. From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming , 2012, Parallel Comput..
[6] Srinivasan Parthasarathy,et al. Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[7] Gerhard Wellein,et al. A unified sparse matrix data format for modern processors with wide SIMD units , 2013, ArXiv.
[8] P. Sadayappan,et al. An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs , 2014, ICS '14.
[9] Jack J. Dongarra,et al. Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.
[10] Gerhard Wellein,et al. Sparse Matrix-vector Multiplication on GPGPU Clusters: A New Storage Format and a Scalable Implementation , 2011, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.
[11] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .
[12] David E. Keyes,et al. Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators , 2012, VECPAR.
[13] Rajesh Bordawekar,et al. Optimizing Sparse Matrix-Vector Multiplication on GPUs , 2009 .
[14] Jack Dongarra,et al. Implementing a Sparse Matrix Vector Product for the SELL-C/SELL-C-sigma formats on NVIDIA GPUs , 2014 .
[15] Ester M. Garzón,et al. Improving the Performance of the Sparse Matrix Vector Product with GPUs , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.
[16] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.
[17] Stefan Turek,et al. Towards a complete FEM-based simulation toolkit on GPUs: Unstructured grid finite element geometric multigrid solvers with strong smoothers based on sparse approximate inverses , 2013 .
[18] Michael Garland,et al. Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[19] Jack J. Dongarra,et al. Tridiagonalization of a Symmetric Dense Matrix on a GPU Cluster , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.
[20] Eric J. Kelmelis,et al. CULA: hybrid GPU accelerated linear algebra routines , 2010, Defense + Commercial Sensing.
[21] Y. Saad,et al. GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .
[22] J. Cuby,et al. ELT-MOS White Paper: Science Overview & Requirements , 2013, 1303.0029.
[23] Matthew G. Knepley,et al. Preliminary Implementation of PETSc Using GPUs , 2013 .
[24] Richard W. Vuduc,et al. Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..
[25] Jack Dongarra,et al. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .
[26] Bilel Hadri,et al. The Automatic Library Tracking Database , 2010 .
[27] J. Krüger,et al. Linear algebra operators for GPU implementation of numerical algorithms , 2003, ACM Trans. Graph..
[28] Pat Hanrahan,et al. Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.
[29] Katherine A. Yelick,et al. Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY , 2001, International Conference on Computational Science.
[30] David E. Keyes,et al. Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU , 2012, Euro-Par Workshops.
[31] Thomas C. Oppe,et al. ITPACKV 2D user's guide , 1989 .
[32] Robert A. van de Geijn,et al. The libflame Library for Dense Matrix Computations , 2009, Computing in Science & Engineering.
[33] Jack J. Dongarra,et al. Optimizing symmetric dense matrix-vector multiplication on GPUs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[34] Jack Dongarra,et al. Preliminary results of autotuning GEMM kernels for the NVIDIA Kepler architecture-GeForce GTX 680 , 2012 .
[35] Jack J. Dongarra,et al. Accelerating GPU Kernels for Dense Linear Algebra , 2010, VECPAR.
[36] David E. Keyes,et al. High Performance Pseudo-analytical Simulation of Multi-Object Adaptive Optics over Multi-GPU Systems , 2014, Euro-Par.
[37] Wolfgang Hackbusch,et al. A Sparse Matrix Arithmetic Based on H-Matrices. Part I: Introduction to H-Matrices , 1999, Computing.
[38] Bilel Hadri,et al. Software Usage on Cray Systems across Three Centers (NICS, ORNL and CSCS) , 2012 .
[39] Jie Cheng,et al. Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..
[40] Jack J. Dongarra,et al. Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems , 2014, Concurr. Comput. Pract. Exp..
[41] Francisco Vázquez,et al. A new approach for sparse matrix vector product on NVIDIA GPUs , 2011, Concurr. Comput. Pract. Exp..
[42] James Demmel,et al. the Parallel Computing Landscape , 2022 .
[43] Jack J. Dongarra,et al. A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.
[44] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.
[45] M. Hestenes,et al. Methods of conjugate gradients for solving linear systems , 1952 .
[46] Cédric Augonnet,et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..
[47] Richard W. Vuduc,et al. Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.
[48] Ninghui Sun,et al. Fast implementation of DGEMM on Fermi GPU , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[49] A. Sevin,et al. A novel fast and accurate pseudo-analytical simulation approach for MOAO , 2014, Astronomical Telescopes and Instrumentation.
[50] David E. Keyes,et al. KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators , 2014, ACM Trans. Math. Softw..
[51] Arutyun Avetisyan,et al. Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures , 2010, HiPEAC.
[52] Joseph L. Greathouse,et al. Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[53] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.
[54] Shengen Yan,et al. yaSpMV: yet another SpMV framework on GPUs , 2014, PPoPP.
[55] Jack J. Dongarra,et al. Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing , 2010, Parallel Comput..
[56] Jack J. Dongarra,et al. An Improved Magma Gemm For Fermi Graphics Processing Units , 2010, Int. J. High Perform. Comput. Appl..