One-sided communication for high performance computing applications

Parallel programming presents a number of critical challenges to application developers. Traditionally, message passing, in which a process explicitly sends data and another explicitly receives the data, has been used to program parallel applications. With the recent growth in multi-core processors, the level of parallelism necessary for next generation machines is cause for concern in the message passing community. The one-sided programming paradigm, in which only one of the two processes involved in communication actively participates in message transfer, has seen increased interest as a potential replacement for message passing. One-sided communication does not carry the heavy per-message overhead associated with modern message passing libraries. The paradigm offers lower synchronization costs and advanced data manipulation techniques such as remote atomic arithmetic and synchronization operations. These combine to present an appealing interface for applications with random communication patterns, which traditionally present message passing implementations with difficulties. This thesis presents a taxonomy of both the one-sided paradigm and of applications which are ideal for the one-sided interface. Three case studies, based on real-world applications, are used to motivate both taxonomies and verify the applicability of the MPI one-sided communication and Cray SHMEM one-sided interfaces to real-world problems. While our results show a number of short-comings with existing implementations, they also suggest that a number of applications could benefit from the one-sided paradigm. Finally, an implementation of the MPI one-sided interface within Open MPI is presented, which provides a number of unique performance features necessary for efficient use of the one-sided programming paradigm.

[1]  George Bosilca,et al.  High Performance RDMA Protocols in HPC , 2006, PVM/MPI.

[2]  Alan M. Frieze,et al.  Random graphs , 2006, SODA '06.

[3]  George Bosilca,et al.  Open MPI: A High-Performance, Heterogeneous MPI , 2006, 2006 IEEE International Conference on Cluster Computing.

[4]  Rajeev Thakur,et al.  Revealing the Performance of MPI RMA Implementations , 2007, PVM/MPI.

[5]  Hyun-Wook Jin,et al.  Efficient Implementation of MPI-2 Passive One-Sided Communication on InfiniBand Clusters , 2004, PVM/MPI.

[6]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[7]  Philip Amburn,et al.  Advanced Message Routing for Scalable Distributed Simulations , 2005 .

[8]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[9]  Ronald Minnich,et al.  A Network-Failure-Tolerant Message-Passing System for Terascale Clusters , 2002, ICS '02.

[10]  Vinton G. Cerf,et al.  Specification of Internet Transmission Control Program , 1974, RFC.

[11]  Phillip A. Porras,et al.  Highly Predictive Blacklisting , 2008, USENIX Security Symposium.

[12]  Brian W. Barrett,et al.  Investigations on InfiniBand: Efficient Network Buffer Utilization at Scale , 2007, PVM/MPI.

[13]  Jarek Nieplocha,et al.  An evaluation of two implementation strategies for optimizing one-sided atomic reduction , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[14]  W. Daniel Hillis,et al.  The connection machine , 1985 .

[15]  Gianna M. Del Corso,et al.  Fast PageRank Computation via a Sparse Linear System , 2005, Internet Math..

[16]  Katherine Yelick,et al.  Titanium Language Reference Manual , 2001 .

[17]  Rajeev Thakur,et al.  Optimizing the Synchronization Operations in Message Passing Interface One-Sided Communication , 2005, Int. J. High Perform. Comput. Appl..

[18]  Robert J. Harrison,et al.  Global arrays: A nonuniform memory access programming model for high-performance computers , 1996, The Journal of Supercomputing.

[19]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[20]  D. Willersinn Parallel Graph Contraction for Dual Irregular Pyramids , 1994 .

[21]  Maurice Herlihy,et al.  Wait-free synchronization , 1991, TOPL.

[22]  Hubert Ritzdorf,et al.  The Implementation of MPI-2 One-Sided Communication for the NEC SX-5 , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[23]  Laxmikant V. Kalé,et al.  Adaptive MPI , 2003, LCPC.

[24]  John H Reif Optimal Parallel Algorithms for Interger Sorting and Graph Connectivity. , 1985 .

[25]  Jeremy G. Siek,et al.  The Boost Graph Library - User Guide and Reference Manual , 2001, C++ in-depth series.

[26]  Guang R. Gao,et al.  ParalleX: A Study of A New Parallel Computation Model , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[27]  Jack J. Dongarra,et al.  Performance Analysis of MPI Collective Operations , 2005, IPDPS.

[28]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[29]  Maciej Golebiewski,et al.  MPI-2 One-Sided Communications on a Giganet SMP Cluster , 2001, PVM/MPI.

[30]  CORPORATE Intel,et al.  The Intel iPSC/2 system: the concurrent supercomputer for production applications , 1988, C3P.

[31]  Michael M. Resch,et al.  Towards Efficient Execution of MPI Applications on the Grid: Porting and Optimization Issues , 2003, Journal of Grid Computing.

[32]  William Gropp,et al.  An Interface to Support the Identification of Dynamic MPI 2 Processes for Scalable Parallel Debugging , 2006, PVM/MPI.

[33]  Brian W. Barrett,et al.  Analysis of Implementation Options for MPI-2 One-Sided , 2007, PVM/MPI.

[34]  Uzi Vishkin,et al.  An O(n² log n) Parallel MAX-FLOW Algorithm , 1982, J. Algorithms.

[35]  Bryan Carpenter,et al.  ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems , 1999, IPPS/SPDP Workshops.

[36]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[37]  William Gropp,et al.  Mpi---the complete reference: volume 1 , 1998 .

[38]  Ron Brightwell,et al.  The Portals 3.0 Message Passing Interface Revision 1.0 , 1999 .

[39]  George Bosilca,et al.  Analysis of the Component Architecture Overhead in Open MPI , 2005, PVM/MPI.

[40]  Wei Huang,et al.  Design of High Performance MVAPICH2: MPI2 over InfiniBand , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[41]  Amith R. Mamidala,et al.  MPI Collectives on Modern Multicore Clusters: Performance Optimizations and Communication Characteristics , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[42]  Douglas P. Gregor,et al.  The Parallel BGL : A Generic Library for Distributed Graph Computations , 2005 .

[43]  Andrew Lumsdaine,et al.  A Component Architecture for LAM/MPI , 2003, PVM/MPI.

[44]  Keith D. Underwood,et al.  The impact of MPI queue usage on message latency , 2004, International Conference on Parallel Processing, 2004. ICPP 2004..

[45]  Galen M. Shipman,et al.  Infiniband scalability in Open MPI , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[46]  Jonathan W. Berry,et al.  Software and Algorithms for Graph Queries on Multithreaded Architectures , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[47]  Timothy G. Mattson,et al.  Programming the Intel 80-core network-on-a-chip Terascale Processor , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[48]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[49]  Dan Bonachea,et al.  A new DMA registration strategy for pinning-based high performance networks , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[50]  Amith R. Mamidala,et al.  MPI-2 One-Sided Usage and Implementation for Read Modify Write Operations: A Case Study with HPCC , 2007, PVM/MPI.

[51]  Forum Mpi MPI: A Message-Passing Interface , 1994 .

[52]  Brian W. Barrett,et al.  Implementing a portable Multi-threaded Graph Library: The MTGL on Qthreads , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[53]  George Bosilca,et al.  Open MPI: a High Performance, Flexible Implementation of MPI Point-to-Point Communications , 2007, Parallel Process. Lett..

[54]  David E. Culler,et al.  Active message applications programming interface and communication subsystem organization , 1995 .

[55]  William Gropp,et al.  MPI: The Complete Reference , Vol. 2 - The MPI-2 Extensions , 1998 .

[56]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .

[57]  George Bosilca,et al.  Open MPI's TEG Point-to-Point Communications Methodology: Comparison to Existing Implementations , 2004, PVM/MPI.

[58]  VishkinUzi,et al.  An O(n2 log n) parallel max-flow algorithm , 1982 .

[59]  Hyun-Wook Jin,et al.  Designing an Efficient Kernel-Level and User-Level Hybrid Approach for MPI Intra-Node Communication on Multi-Core Systems , 2008, 2008 37th International Conference on Parallel Processing.

[60]  CORPORATE Ncube The NCUBE family of high-performance parallel computer systems , 1988, C3P.

[61]  Dawson R. Engler,et al.  ASHs: application-specific handlers for high-performance messaging , 1997, TNET.

[62]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[63]  David Callahan,et al.  A future-based parallel language for a general-purpose highly-parallel computer , 1990 .

[64]  Brian W. Barrett,et al.  An Evaluation of Open MPI's Matching Transport Layer on the Cray XT , 2007, PVM/MPI.

[65]  Stephen Booth,et al.  Single sided MPI implementations for SUN MPI , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[66]  Jack J. Dongarra,et al.  MPI Collective Algorithm Selection and Quadtree Encoding , 2006, PVM/MPI.

[67]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[68]  Jack J. Dongarra,et al.  Performance analysis of MPI collective operations , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[69]  Joachim Worringen,et al.  Exploiting transparent remote memory access for non-contiguous- and one-sided-communication , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[70]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[71]  Greg Burns,et al.  LAM: An Open Cluster Environment for MPI , 2002 .

[72]  Jason Duell,et al.  Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations , 2004, Int. J. High Perform. Comput. Netw..

[73]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[74]  CORPORATE Meiko The Meiko computing surface: an example of a massively parallel system , 1988, C3P.

[75]  Brian W. Barrett,et al.  The Open Run-Time Environment (OpenRTE): A Transparent Multi-cluster Environment for High-Performance Computing , 2005, PVM/MPI.

[76]  Jack J. Dongarra,et al.  HARNESS and fault tolerant MPI , 2001, Parallel Comput..

[77]  Kees Verstoep,et al.  Network performance-aware collective communication for clustered wide-area systems , 2001, Parallel Comput..

[78]  Anthony LaMarca,et al.  A performance evaluation of lock-free synchronization protocols , 1994, PODC '94.

[79]  Michael A. Heroux Design issues for numerical libraries on scalable multicore architectures , 2008 .

[80]  Vaidy S. Sunderam,et al.  PVM: A Framework for Parallel Distributed Computing , 1990, Concurr. Pract. Exp..

[81]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[82]  John D. Valois Lock-free linked lists using compare-and-swap , 1995, PODC '95.

[83]  Brian W. Barrett,et al.  Integration of the LAM / MPI environment and the PBS scheduling system , 2003 .

[84]  Anthony Skjellum,et al.  Extending the message passing interface (MPI) , 1994, Proceedings Scalable Parallel Libraries Conference.

[85]  Johan Bollen,et al.  Journal status , 2006, Scientometrics.

[86]  Philip Heidelberger,et al.  The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer , 2008, ICS '08.

[87]  William Gropp,et al.  MPI-2: Extending the Message-Passing Interface , 1996, Euro-Par, Vol. I.

[88]  Andrew Lumsdaine,et al.  Lifting sequential graph algorithms for distributed-memory parallel computation , 2005, OOPSLA '05.

[89]  Simon Kahan,et al.  Tera hardware-software cooperation , 1997, SC '97.

[90]  Brian W. Barrett,et al.  Implementation of Open MPI on Red Storm , 2005 .

[91]  Timothy G. Mattson,et al.  Parallel programming: Can we PLEASE get it right this time? , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[92]  Brian W. Barrett,et al.  The Open Run-Time Environment (OpenRTE): A transparent multicluster environment for high-performance computing , 2008, Future Gener. Comput. Syst..

[93]  Ravi Kumar,et al.  Scalability Study of the KSR-1 , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[94]  Francesco Romani,et al.  Fast PageRank Computation via a Sparse Linear System , 2004, Internet Math..

[95]  David B. Skillicorn,et al.  Questions and Answers about BSP , 1997, Sci. Program..