Benchmarking GPUs to tune dense linear algebra
暂无分享,去创建一个
[1] Ramesh C. Agarwal,et al. Vector and parallel algorithms for Cholesky factorization on IBM 3090 , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).
[2] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.
[3] Christian H. Bischof,et al. An Adaptive Blocking Strategy for Matrix Factorizations , 1990, CONPAR.
[4] Jack Dongarra,et al. LAPACK: a portable linear algebra library for high-performance computers , 1990, SC.
[5] Jack Dongarra,et al. LAPACK Working Note 24: LAPACK Block Factorization Algorithms on the INtel iPSC/860 , 1990 .
[6] Allan Porterfield,et al. The Tera computer system , 1990 .
[7] Jaeyoung Choi,et al. Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines , 1994, Sci. Program..
[8] Jack Dongarra,et al. Numerical Linear Algebra for High-Performance Computers , 1998 .
[9] Pat Hanrahan,et al. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication , 2004, Graphics Hardware.
[10] Dinesh Manocha,et al. LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[11] N.K. Govindaraju,et al. A Memory Model for Scientific Algorithms on Graphics Processors , 2006, ACM/IEEE SC 2006 Conference (SC'06).
[12] Steve Scott,et al. The Cray BlackWidow: a highly scalable vector multiprocessor , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[13] Bingsheng He,et al. Efficient gather and scatter operations on graphics processors , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[14] Rafael Mayo,et al. Solving Dense Linear Systems on Graphics Processors , 2008, Euro-Par.
[15] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.
[16] Jack Dongarra,et al. Some issues in dense linear algebra for multicore and special purpose architectures , 2008 .
[17] Uday Bondhugula,et al. A compiler framework for optimization of affine loop nests for gpgpus , 2008, ICS '08.
[18] J. Kulpa,et al. Time-frequency analysis using NVIDIA compute unified device architecture (CUDA) , 2009, Symposium on Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments (WILGA).