A component architecture for the message passing interface (mpi): the systems services interface (ssi) of lam/mpi

This work presents the design and implementation of a component system architecture in LAM/MPI, a production quality, open source implementation of the MPI-1 and MPI-2 standards. Previous versions of LAM/MPI, as well as other MPI implementations, are based on monolithic software architectures that—regardless of how well-abstracted and logically constructed—are highly complex software packages, presenting a steep learning curve for new developers and third parties. As a result, parallel researchers face enormous logistical and technical difficulties when using or adapting existing implementations for their own work. Not only are existing code bases typically locked into highly-specific implementation models (effectively preventing extensions that did not already conform to existing models), but the time investment required to train a researcher in a complex software system can be prohibitive. To address these issues, the current version of LAM/MPI has been re-architected to utilize a component system architecture consisting of four component frameworks and a meta framework that ties them together. Each component framework was designed from analysis of prior monolithic implementations of LAM/MPI and represents a major functional category: run-time environment startup, MPI point-to-point communication, MPI collective communication, and parallel check-point/restart. The result is an MPI implementation that is highly modular, has published abstraction and interface boundaries, and is significantly easier to develop, maintain, and use as a vehicle for research. Performance results are shown demonstrating that this component-based approach provides identical (if not better) performance compared to prior monolithic-based implementations.

[1]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[2]  Bjarne Stroustrup,et al.  C++ Programming Language , 1986, IEEE Softw..

[3]  William Gropp,et al.  Dynamic process management in an MPI setting , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[4]  Mark J. Clement,et al.  Core Algorithms of the Maui Scheduler , 2001, JSSPP.

[5]  Hua Zhong,et al.  CRAK: Linux Checkpoint/Restart As a Kernel Module , 1996 .

[6]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[7]  Jeffrey F. Naughton,et al.  Real-time, concurrent checkpoint for parallel programs , 1990, PPOPP '90.

[8]  Samuel Webb Williams,et al.  The Component Object Model: A Technical Overview , 1994 .

[9]  Anthony Skjellum,et al.  MPI/FT/sup TM/: architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[10]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[11]  Rajeev Thakur,et al.  On implementing MPI-IO portably and with high performance , 1999, IOPADS '99.

[12]  Leonid Oliker,et al.  System Utilization Benchmark on the Cray T3E and IBM SP , 2000, JSSPP.

[13]  Bharat K. Bhargava,et al.  Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[14]  R. Thakur,et al.  Improving the Performance of MPI Collective Communication on Switched Networks , 2003 .

[15]  Robert B. Ross,et al.  Using MPI-2: Advanced Features of the Message Passing Interface , 2003, CLUSTER.

[16]  Henri E. Bal,et al.  MPI's Reduction Operations in Clustered Wide Area Systems. , 1999 .

[17]  Brian Barrett,et al.  Boot System Services Interface (SSI) Modules for LAM/MPI API Version 1.0.0 / SSI Version 1.0.0 , 2003 .

[18]  Brian Randell System Structure for Software Fault Tolerance , 1975, IEEE Trans. Software Eng..

[19]  Sheng Liang,et al.  Dynamic class loading in the Java virtual machine , 1998, OOPSLA '98.

[20]  William R. Dieter,et al.  A user-level checkpointing library for POSIX threads programs , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[21]  Henri E. Bal,et al.  Bandwidth-efficient collective communication for clustered wide area systems , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[22]  Flaviu Cristian,et al.  Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement , 1995, Inf. Comput..

[23]  David L. Russell,et al.  State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[24]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[25]  Jack Dongarra,et al.  Fault Tolerant Communication Library and Applications for High Performance Computing , 2003 .

[26]  Rajeev Thakur,et al.  Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[27]  Corporate The MPI Forum,et al.  MPI: a message passing interface , 1993, Supercomputing '93.

[28]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[29]  Qing Huang,et al.  A Comparison of MPICH Allgather Algorithms on Switched Networks , 2003, PVM/MPI.

[30]  Al Stevens,et al.  C programming , 1990 .

[31]  Jack Dongarra,et al.  Integrated Pvm Framework Supports Heterogeneous Network Computing , 1993 .

[32]  Andrew Lumsdaine,et al.  A Component Architecture for LAM/MPI , 2003, PVM/MPI.

[33]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[34]  Erik A. Hendriks,et al.  BProc: the Beowulf distributed process space , 2002, ICS '02.

[35]  Qianfeng Zhang MPI collective operations over Myrinet , 2002 .

[36]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[37]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[38]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[39]  Harrick M. Vin,et al.  Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[40]  William Gropp The MPI-2 extensions , 1998 .

[41]  William Gropp,et al.  MPI-2: Extending the Message-Passing Interface , 1996, Euro-Par, Vol. I.

[42]  Ian T. Foster,et al.  A Grid-Enabled MPI: Message Passing in Heterogeneous Distributed Computing Systems , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[43]  Sathish S. Vadhiyar,et al.  Automatically Tuned Collective Communications , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[44]  David R. Butenhof Programming with POSIX threads , 1993 .

[45]  Zhou Lei,et al.  The portable batch scheduler and the maui scheduler on linux clusters , 2000 .

[46]  Laxmikant V. Kale,et al.  A tutorial introduction to charm , 1992 .

[47]  CORPORATE Computer Science and Telecommunications Board,et al.  Academic careers for experimental computer scientists and engineers , 1994, CACM.

[48]  David Chappell,et al.  Understanding ActiveX and OLE: a guide for developers and managers , 1996 .

[49]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[50]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[51]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[52]  Yi-Min Wang,et al.  Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[53]  Jyh-Jong Tsay,et al.  Checkpointing Message-Passing Interface (MPI) parallel programs , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[54]  Brian W. Barrett,et al.  The system services interface (SSI) to LAM/MPI , 2003 .

[55]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[56]  Marc Snir,et al.  The Communication Software and Parallel Environment of the IBM SP2 , 1995, IBM Syst. J..

[57]  Michel Raynal,et al.  Consistent Checkpointing in Message Passing Distributed Systems , 1995 .

[58]  James Arthur Kohl,et al.  HARNESS: a next generation distributed virtual machine , 1999, Future Gener. Comput. Syst..

[59]  Von Welch,et al.  Fine-Grain Authorization Policies in the GRID: Design and Implementation , 2003, Middleware Workshops.

[60]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[61]  Ronald Minnich,et al.  A Network-Failure-Tolerant Message-Passing System for Terascale Clusters , 2002, ICS '02.

[62]  Péter Urbán,et al.  Chasing the FLP impossibility result in a LAN: or, How robust can a fault tolerant server be? , 2001, Proceedings 20th IEEE Symposium on Reliable Distributed Systems.

[63]  Dhabaleswar K. Panda,et al.  Efficient collective operations using remote memory operations on VIA-based clusters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[64]  Kai Li,et al.  CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[65]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[66]  P. Merkey,et al.  Beowulf: harnessing the power of parallelism in a pile-of-PCs , 1997, 1997 IEEE Aerospace Conference.

[67]  Marvin Solomon,et al.  The evolution of Condor checkpointing , 1999 .

[68]  Taesoon Park,et al.  Checkpointing and rollback-recovery in distributed systems , 1989 .

[69]  Greg Burns,et al.  LAM: An Open Cluster Environment for MPI , 2002 .

[70]  Edward W. Felten,et al.  Improving the performance of message-passing applications by multithreading , 1992, Proceedings Scalable High Performance Computing Conference SHPCC-92..

[71]  Philip M. Papadopoulos,et al.  NPACI: rocks: tools and techniques for easily deploying manageable Linux clusters , 2001, Proceedings 42nd IEEE Symposium on Foundations of Computer Science.

[72]  Mark A. Taylor,et al.  Architecture of LA-MPI, a network-fault-tolerant MPI , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[73]  Ron Brightwell,et al.  The Portals 3.0 Message Passing Interface Revision 1.0 , 1999 .

[74]  Bronis R. de Supinski,et al.  Exploiting hierarchy in parallel computer networks to optimize collective operation performance , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[75]  Flaviu Cristian,et al.  Reaching agreement on processor-group membrship in synchronous distributed systems , 1991, Distributed Computing.

[76]  Henri E. Bal,et al.  MagPIe: MPI's collective communication operations for clustered wide area systems , 1999, PPoPP '99.

[77]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[78]  Sergei Gorlatch,et al.  Send-receive considered harmful: Myths and realities of message passing , 2004, TOPL.

[79]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[80]  Michael Franz Dynamic Linking of Software Components , 1997, Computer.

[81]  Augusto Ciuffoletti,et al.  A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[82]  Xin Yuan,et al.  CC--MPI: a compiled communication capable MPI prototype for ethernet switched clusters , 2003, PPoPP '03.

[83]  Jack J. Dongarra,et al.  HARNESS and fault tolerant MPI , 2001, Parallel Comput..

[84]  Clemens A. Szyperski,et al.  Component software - beyond object-oriented programming , 2002 .

[85]  Jack Dongarra,et al.  PVM: Experiences, current status and future direction , 1993, Supercomputing '93. Proceedings.

[86]  William Gropp,et al.  Mpi - The Complete Reference: Volume 2, the Mpi Extensions , 1998 .

[87]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[88]  Brian W. Barrett,et al.  Request progression interface (RPI) system services interface (SSI) modules for LAM/MPI , 2003 .

[89]  Ian T. Foster,et al.  The Globus project: a status report , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).

[90]  Richard Y. Kain,et al.  Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks , 1992, IEEE Trans. Parallel Distributed Syst..

[91]  Geoffrey James The Tao of Programming , 1987 .

[92]  Rajeev Thakur,et al.  Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[93]  Yuval Tamir,et al.  ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS , 1984 .

[94]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[95]  W. Kent Fuchs,et al.  Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[96]  Jack J. Dongarra,et al.  Visualization and debugging in a heterogeneous environment , 1993, Computer.

[97]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[98]  Scott R. Kohn,et al.  Toward a Common Component Architecture for High-Performance Scientific Computing , 1999, HPDC.

[99]  Indranil Gupta,et al.  On scalable and efficient distributed failure detectors , 2001, PODC '01.

[100]  Sam Toueg,et al.  The weakest failure detector for solving consensus , 1992, PODC '92.