Performance Analysis of a Hybrid MPI / CUDA Implementation of the NAS-LU Benchmark
暂无分享,去创建一个
[1] Gabriel Zachmann,et al. GPU-ABiSort: optimal parallel sorting on stream architectures , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[2] Joseph T. Kider,et al. All-pairs shortest-paths for large graphs on the GPU , 2008, GH '08.
[3] Suzanne M. Kelly,et al. Summary of multi-core hardware and programming model investigations , 2008 .
[4] Jing Xie,et al. Optimizing Sweep3D for Graphic Processor Unit , 2010, ICA3PP.
[5] Klaus Schulten,et al. GPU acceleration of cutoff pair potentials for molecular modeling applications , 2008, CF '08.
[6] Ralf H. Reussner,et al. SKaMPI: A Detailed, Accurate MPI Benchmark , 1998, PVM/MPI.
[7] P. J. Narayanan,et al. Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.
[8] Bandwidth intensive 3-D FFT kernel for GPUs using CUDA , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[9] Uday Bondhugula,et al. Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU , 2010 .
[10] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.
[11] Yao Zhang,et al. Fast tridiagonal solvers on the GPU , 2010, PPoPP '10.
[12] Inanc Senocak,et al. An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters , 2010 .
[13] Murat Efe Guney,et al. On the limits of GPU acceleration , 2010 .
[14] Burton J. Smith,et al. High performance discrete Fourier transforms on graphics processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[15] Fumihiko Ino,et al. Design and implementation of the Smith-Waterman algorithm on the CUDA-compatible GPU , 2008, 2008 8th IEEE International Conference on BioInformatics and BioEngineering.
[16] Uday Bondhugula,et al. Believe it or not! multi-core CPUs can match GPU performance for a FLOP-intensive application! , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[17] Fabrizio Petrini,et al. A general predictive performance model for wavefront algorithms on clusters of SMPs , 2000, Proceedings 2000 International Conference on Parallel Processing.
[18] Kevin Skadron,et al. Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[19] Mary K. Vernon,et al. A plug-and-play model for evaluating wavefront computations on parallel architectures , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[20] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..
[21] Giorgio Valle,et al. CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment , 2008, BMC Bioinformatics.
[22] J. A. Smith,et al. WARPP: a toolkit for simulating high-performance parallel scientific codes , 2009, SimuTools.
[23] Satoshi Matsuoka,et al. An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[24] Leslie Lamport,et al. The parallel execution of DO loops , 1974, CACM.
[25] Jack J. Dongarra,et al. A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators , 2010, VECPAR.
[26] Fabrizio Petrini,et al. Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[27] Pradeep Dubey,et al. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.
[28] Ezequiel Herruzo,et al. A New Parallel Sorting Algorithm based on Odd-Even Mergesort , 2007, 15th EUROMICRO International Conference on Parallel, Distributed and Network-Based Processing (PDP'07).
[29] Maurice Yarrow,et al. Communication Improvement for the LU NAS Parallel Benchmark: A Model for Efficient Parallel Relaxation Schemes , 1997 .