Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs
暂无分享,去创建一个
Jack J. Dongarra | Stanimire Tomov | Azzam Haidar | Ahmad Abdelfattah | J. Dongarra | A. Haidar | S. Tomov | A. Abdelfattah
[1] Antonino Tumeo,et al. Accelerating subsurface transport simulation on heterogeneous clusters , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).
[2] David E. Bernholdt,et al. Automatic code generation for many-body electronic structure methods: the tensor contraction engine , 2006 .
[3] Jack J. Dongarra,et al. Performance, Design, and Autotuning of Batched GEMM for GPUs , 2016, ISC.
[4] Jack J. Dongarra,et al. Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.
[5] James Demmel,et al. LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs , 2008 .
[6] Jack J. Dongarra,et al. Towards batched linear solvers on accelerated hardware platforms , 2015, PPOPP.
[7] André Seznec,et al. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[8] Merek A. Chertkow,et al. Multicore and Accelerator Development for a Leadership-Class Stellar Astrophysics Code , 2012, PARA.
[9] Chetan Jhurani,et al. A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices , 2013, J. Parallel Distributed Comput..
[10] Massimiliano Fatica,et al. Power/Performance Trade-Offs of Small Batched LU Based Solvers on GPUs , 2013, Euro-Par.
[11] Ninghui Sun,et al. Fast implementation of DGEMM on Fermi GPU , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[12] Jack J. Dongarra,et al. Experiences in autotuning matrix multiplication for energy minimization on GPUs , 2015, Concurr. Comput. Pract. Exp..
[13] Jack J. Dongarra,et al. An Improved Magma Gemm For Fermi Graphics Processing Units , 2010, Int. J. High Perform. Comput. Appl..
[14] Jack J. Dongarra,et al. Batched matrix computations on hardware accelerators based on GPUs , 2015, Int. J. High Perform. Comput. Appl..
[15] Endong Wang,et al. Intel Math Kernel Library , 2014 .
[16] Timothy A. Davis,et al. Algorithm 9xx: Sparse QR Factorization on the GPU , 2015 .
[17] Jack J. Dongarra,et al. Dense linear algebra solvers for multicore with GPU accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).
[18] Jack J. Dongarra,et al. A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.
[19] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.
[20] Jack J. Dongarra,et al. Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs , 2016, IEEE Transactions on Parallel and Distributed Systems.
[21] Jack J. Dongarra,et al. A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations , 2015, ISC.
[22] David E. Keyes,et al. Redesigning Triangular Dense Matrix Computations on GPUs , 2016, Euro-Par.
[23] Kurt Keutzer,et al. A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.