Lessons learned from contrasting a BLAS kernel implementations
暂无分享,去创建一个
[1] Arie E. Kaufman,et al. GPU Cluster for High Performance Computing , 2004, Proceedings of the ACM/IEEE SC2004 Conference.
[2] Jack J. Dongarra,et al. Dense linear algebra solvers for multicore with GPU accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).
[3] Charles L. Lawson,et al. Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.
[4] J. Demmel,et al. Sun Microsystems , 1996 .
[5] Pradeep Dubey,et al. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.
[6] Gene H. Golub,et al. Matrix computations (3rd ed.) , 1996 .
[7] John D. Davis,et al. BLAS Comparison on FPGA, CPU and GPU , 2010, 2010 IEEE Computer Society Annual Symposium on VLSI.