Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters
暂无分享,去创建一个
Pradeep Dubey | Kiran Pamnany | Alexander Heinecke | Karthikeyan Vaidyanathan | Mikhail Smelyanskiy | Aniruddha G. Shet | Daehyun Kim | Jongsoo Park | Bharat Kaul | Dhiraj D. Kalamkar | Bálint Joó
[1] M. Hestenes,et al. Methods of conjugate gradients for solving linear systems , 1952 .
[2] Pradeep Dubey,et al. Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[3] Torsten Hoefler,et al. Parallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient Using MPI Datatypes , 2010, EuroMPI.
[4] Fred G. Gustavson,et al. Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..
[5] J. Tukey,et al. An algorithm for the machine calculation of complex Fourier series , 1965 .
[6] Surendra Byna,et al. Improving the performance of MPI derived datatypes by optimizing memory-access cost , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.
[7] Dhabaleswar K. Panda,et al. MVAPICH-PRISM: A proxy-based communication framework using InfiniBand and SCIF for Intel MIC clusters , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[8] Surendra Byna,et al. Automatic Memory Optimizations for Improving MPI Derived Datatype Performance , 2006, PVM/MPI.
[9] Ping Tak Peter Tang,et al. A framework for low-communication 1-D FFT , 2012, HiPC 2012.
[10] William J. Dally,et al. Principles and Practices of Interconnection Networks , 2004 .
[11] Bálint Joó,et al. Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[12] Pradeep Dubey,et al. Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[13] Ping Tak Peter Tang,et al. A framework for low-communication 1-D FFT , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[14] Robert G. Edwards,et al. The Chroma Software System for Lattice QCD , 2004 .
[15] Dhabaleswar K. Panda,et al. Zero-Copy MPI Derived Datatype Communication over InfiniBand , 2004, PVM/MPI.
[16] Victor W. Lee,et al. Lattice QCD on Intel Xeon Phi , 2013 .
[17] Peter A. Boyle,et al. The BlueGene/Q supercomputer , 2012 .