Leveraging non-blocking collective communication in high-performance applications
暂无分享,去创建一个
[1] G. Liu,et al. Overlap of Computation and Communication on Shared-Memory , 1999, Scalable Comput. Pract. Exp..
[2] Torsten Hoefler,et al. Optimizing a conjugate gradient solver with non-blocking collective operations , 2007, Parallel Comput..
[3] Jason Duell,et al. An evaluation of current high-performance networks , 2003, Proceedings International Parallel and Distributed Processing Symposium.
[4] Katherine A. Yelick,et al. A performance analysis of the Berkeley UPC compiler , 2003, ICS '03.
[5] F. Petrini,et al. The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).
[6] Torsten Hoefler,et al. Optimizing a conjugate gradient solver with non-blocking collective operations , 2006, Parallel Comput..
[7] Chris J. Scheiman,et al. LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.
[8] Torsten Hoefler,et al. Low-Overhead LogGP Parameter Assessment for Modern Interconnection Networks , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[9] Andrea C. Arpaci-Dusseau,et al. Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.
[10] Rice UniversityCORPORATE,et al. High performance Fortran language specification , 1993 .
[11] Amith R. Mamidala,et al. Fast and scalable MPI-level broadcast using InfiniBand's hardware multicast support , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[12] Darren J. Kerbyson,et al. MPI tools and performance studies - Quantifying the potential benefit of overlapping communication and computation in large-scale scientific applications , 2006, SC.
[13] Ken Kennedy,et al. Compiler optimizations for Fortran D on MIMD distributed-memory machines , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).
[14] Torsten Hoefler,et al. A Case for Standard Non-blocking Collective Operations , 2007, PVM/MPI.
[15] Torsten Hoefler,et al. Adding Low-Cost Hardware Barrier Support to Small Commodity Clusters , 2006, ARCS Workshops.
[16] Steven G. Johnson,et al. The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.
[17] Torsten Hoefler,et al. Implementation and performance analysis of non-blocking collective operations for MPI , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[18] Mark J. Clement,et al. Overlapping Computations, Communications and I/O in parallel Sorting , 1995, J. Parallel Distributed Comput..
[19] Rossen Dimitrov,et al. Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving , 2001 .
[20] Denis Caromel,et al. Optimizing Metacomputing with Communication-Computation Overlap , 2001, PaCT.
[21] Jack J. Dongarra,et al. Performance Study of LU Factorization with Low Communication Overhead on Multiprocessors , 1995, Parallel Process. Lett..
[22] Christophe Calvin,et al. Minimizing Communication Overhead Using Pipelining for Multi-Dimensional FFT on Distributed Memory Machines , 1993, PARCO.
[23] Anshu Dubey,et al. Redistribution strategies for portable parallel FFT: a case study , 2001, Concurr. Comput. Pract. Exp..
[24] T. von Eicken,et al. Parallel programming in Split-C , 1993, Supercomputing '93.
[25] Katherine A. Yelick,et al. Optimizing bandwidth limited problems using one-sided communication and overlap , 2005, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[26] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .
[27] Susan Coghlan,et al. The Influence of Operating Systems on the Performance of Collective Operations at Extreme Scale , 2006, 2006 IEEE International Conference on Cluster Computing.
[28] Stefan Goedecker,et al. An efficient 3-dim FFT for plane wave electronic structure calculations on massively parallel machines composed of multiprocessor nodes , 2003 .
[29] Tarek S. Abdelrahman,et al. Computation-Communication Overlap on Network-of-Workstation Multiprocessors , 2001 .
[30] George Bosilca,et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.
[31] Chung-Ta King,et al. Pipelined Data Parallel Algorithms-I: Concept and Modeling , 1990, IEEE Trans. Parallel Distributed Syst..
[32] J.C. Sancho,et al. Quantifying the Potential Benefit of Overlapping Communication and Computation in Large-Scale Scientific Applications , 2006, ACM/IEEE SC 2006 Conference (SC'06).
[33] Jonathan W. Berry,et al. Challenges in Parallel Graph Processing , 2007, Parallel Process. Lett..
[34] Torsten Hoefler,et al. Optimizing non-blocking collective operations for infiniband , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[35] Sergei Gorlatch,et al. Send-receive considered harmful: Myths and realities of message passing , 2004, TOPL.
[36] Laxmikant V. Kalé,et al. A framework for collective personalized communication , 2003, Proceedings International Parallel and Distributed Processing Symposium.
[37] James A. Storer,et al. Parallel algorithms for data compression , 1985, JACM.
[38] J. White,et al. An Analysis of Popular Mpi Implementations , .
[39] Keith D. Underwood,et al. Implications of application usage characteristics for collective communication offload , 2006, Int. J. High Perform. Comput. Netw..
[40] D. Martin Swany,et al. Transformations to Parallel Codes for Communication-Computation Overlap , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[41] Jack Dongarra,et al. Tiling on systems with communication/computation overlap , 1999 .
[42] Torsten Hoefler,et al. A Case for Non-blocking Collective Operations , 2006, ISPA Workshops.