Cache-Oblivious MPI All-to-All Communications Based on Morton Order
暂无分享,去创建一个
[1] Ying Qian,et al. Efficient shared memory and RDMA based collectives on multi-rail QsNetII SMP clusters , 2008, Cluster Computing.
[2] Eli Upfal,et al. Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..
[3] Rajeev Thakur,et al. Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..
[4] Dhabaleswar K. Panda,et al. Virtual machine aware communication libraries for high performance computing , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[5] Guy E. Blelloch,et al. The data locality of work stealing , 2000, SPAA.
[6] Joel H. Ferziger,et al. Computational methods for fluid dynamics , 1996 .
[7] Brice Goglin,et al. KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework , 2013, J. Parallel Distributed Comput..
[8] Hubert Ritzdorf,et al. The scalable process topology interface of MPI 2.2 , 2011, Concurr. Comput. Pract. Exp..
[9] Sayantan Sur,et al. Lightweight kernel-level primitives for high-performance MPI intra-node communication over multi-core systems , 2007, 2007 IEEE International Conference on Cluster Computing.
[10] Michael Bader,et al. Hardware-Oriented Implementation of Cache Oblivious Matrix Operations Based on Space-Filling Curves , 2007, PPAM.
[11] Jesús Carretero,et al. Data Locality Aware Strategy for Two-Phase Collective I/O , 2008, VECPAR.
[12] Guy E. Blelloch,et al. Low depth cache-oblivious algorithms , 2010, SPAA '10.
[13] Jorge González-Domínguez,et al. Scalable PGAS collective operations in NUMA clusters , 2014, Cluster Computing.
[14] Tao Yang,et al. Optimizing threaded MPI execution on SMP clusters , 2001, ICS '01.
[15] Jeremy D. Frens,et al. QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism , 2003, PPoPP '03.
[16] John M. Levesque,et al. Practical performance portability in the Parallel Ocean Program (POP) , 2005, Concurr. Pract. Exp..
[17] Torsten Hoefler,et al. Hybrid MPI: Efficient message passing for multi-core systems , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[18] Volker Strumpen,et al. The Cache Complexity of Multithreaded Cache Oblivious Algorithms , 2009, SPAA '06.
[19] Torsten Hoefler,et al. NUMA-aware shared-memory collective communication for MPI , 2013, HPDC.
[20] Francois Gygi. Large-scale first-principles molecular dynamics: moving from terascale to petascale computing , 2006 .
[21] A. N. Yzelman,et al. A Cache-Oblivious Sparse Matrix–Vector Multiplication Scheme Based on the Hilbert Curve , 2012 .
[22] Sophie Papst,et al. Computational Methods For Fluid Dynamics , 2016 .
[23] Robert D. Blumofe,et al. Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.
[24] Message Passing Interface Forum. MPI: A message - passing interface standard , 1994 .
[25] Charles E. Leiserson,et al. Cache-Oblivious Algorithms , 2003, CIAC.
[26] Siddhartha Chatterjee,et al. Cache-efficient matrix transposition , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).
[27] Sabela Ramos,et al. Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi , 2013, HPDC.
[28] Baodong Wu,et al. Hybrid-optimization strategy for the communication of large-scale Kinetic Monte Carlo simulation , 2017, Comput. Phys. Commun..
[29] Torsten Hoefler,et al. Improved MPI collectives for MPI processes in shared address spaces , 2014, Cluster Computing.
[30] Michael Woodacre. The SGI® Altix 3000 Global Shared-Memory Architecture , 2003 .
[31] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..
[32] Amith R. Mamidala,et al. Efficient Shared Memory and RDMA Based Design for MPI_Allgather over InfiniBand , 2006, PVM/MPI.
[33] Bronis R. de Supinski,et al. Exploiting hierarchy in parallel computer networks to optimize collective operation performance , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.
[34] George Bosilca,et al. HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[35] Maria Ganzha,et al. Utilizing Recursive Storage in Sparse Matrix-Vector Multiplication - Preliminary Considerations , 2010, CATA.