论文信息 - A component architecture for the message passing interface (mpi): the systems services interface (ssi) of lam/mpi

A component architecture for the message passing interface (mpi): the systems services interface (ssi) of lam/mpi

This work presents the design and implementation of a component system architecture in LAM/MPI, a production quality, open source implementation of the MPI-1 and MPI-2 standards. Previous versions of LAM/MPI, as well as other MPI implementations, are based on monolithic software architectures that—regardless of how well-abstracted and logically constructed—are highly complex software packages, presenting a steep learning curve for new developers and third parties. As a result, parallel researchers face enormous logistical and technical difficulties when using or adapting existing implementations for their own work. Not only are existing code bases typically locked into highly-specific implementation models (effectively preventing extensions that did not already conform to existing models), but the time investment required to train a researcher in a complex software system can be prohibitive. To address these issues, the current version of LAM/MPI has been re-architected to utilize a component system architecture consisting of four component frameworks and a meta framework that ties them together. Each component framework was designed from analysis of prior monolithic implementations of LAM/MPI and represents a major functional category: run-time environment startup, MPI point-to-point communication, MPI collective communication, and parallel check-point/restart. The result is an MPI implementation that is highly modular, has published abstraction and interface boundaries, and is significantly easier to develop, maintain, and use as a vehicle for research. Performance results are shown demonstrating that this component-based approach provides identical (if not better) performance compared to prior monolithic-based implementations.

Jeffrey M. Squyres | Andrew Lumsdaine | A. Lumsdaine | J. Squyres

[1] Georg Stellner,et al. CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[2] Bjarne Stroustrup,et al. C++ Programming Language , 1986, IEEE Softw..

[3] William Gropp,et al. Dynamic process management in an MPI setting , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[4] Mark J. Clement,et al. Core Algorithms of the Maui Scheduler , 2001, JSSPP.

[5] Hua Zhong,et al. CRAK: Linux Checkpoint/Restart As a Kernel Module , 1996 .

[6] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[7] Jeffrey F. Naughton,et al. Real-time, concurrent checkpoint for parallel programs , 1990, PPOPP '90.

[8] Samuel Webb Williams,et al. The Component Object Model: A Technical Overview , 1994 .

[9] Anthony Skjellum,et al. MPI/FT/sup TM/: architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[10] Roy Friedman,et al. Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[11] Rajeev Thakur,et al. On implementing MPI-IO portably and with high performance , 1999, IOPADS '99.

[12] Leonid Oliker,et al. System Utilization Benchmark on the Cray T3E and IBM SP , 2000, JSSPP.

[13] Bharat K. Bhargava,et al. Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[14] R. Thakur,et al. Improving the Performance of MPI Collective Communication on Switched Networks , 2003 .

[15] Robert B. Ross,et al. Using MPI-2: Advanced Features of the Message Passing Interface , 2003, CLUSTER.

[16] Henri E. Bal,et al. MPI's Reduction Operations in Clustered Wide Area Systems. , 1999 .

[17] Brian Barrett,et al. Boot System Services Interface (SSI) Modules for LAM/MPI API Version 1.0.0 / SSI Version 1.0.0 , 2003 .

[18] Brian Randell. System Structure for Software Fault Tolerance , 1975, IEEE Trans. Software Eng..

[19] Sheng Liang,et al. Dynamic class loading in the Java virtual machine , 1998, OOPSLA '98.

[20] William R. Dieter,et al. A user-level checkpointing library for POSIX threads programs , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[21] Henri E. Bal,et al. Bandwidth-efficient collective communication for clustered wide area systems , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[22] Flaviu Cristian,et al. Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement , 1995, Inf. Comput..

[23] David L. Russell,et al. State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[24] Nancy A. Lynch,et al. Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[25] Jack Dongarra,et al. Fault Tolerant Communication Library and Applications for High Performance Computing , 2003 .

[26] Rajeev Thakur,et al. Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[27] Corporate The MPI Forum,et al. MPI: a message passing interface , 1993, Supercomputing '93.

[28] Jason Duell,et al. The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[29] Qing Huang,et al. A Comparison of MPICH Allgather Algorithms on Switched Networks , 2003, PVM/MPI.

[30] Al Stevens,et al. C programming , 1990 .

[31] Jack Dongarra,et al. Integrated Pvm Framework Supports Heterogeneous Network Computing , 1993 .

[32] Andrew Lumsdaine,et al. A Component Architecture for LAM/MPI , 2003, PVM/MPI.

[33] Michael L. Scott,et al. Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[34] Erik A. Hendriks,et al. BProc: the Beowulf distributed process space , 2002, ICS '02.

[35] Qianfeng Zhang. MPI collective operations over Myrinet , 2002 .

[36] Anthony Skjellum,et al. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[37] Jian Xu,et al. Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[38] J. Duell. The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[39] Harrick M. Vin,et al. Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[40] William Gropp. The MPI-2 extensions , 1998 .

[41] William Gropp,et al. MPI-2: Extending the Message-Passing Interface , 1996, Euro-Par, Vol. I.

[42] Ian T. Foster,et al. A Grid-Enabled MPI: Message Passing in Heterogeneous Distributed Computing Systems , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[43] Sathish S. Vadhiyar,et al. Automatically Tuned Collective Communications , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[44] David R. Butenhof. Programming with POSIX threads , 1993 .

[45] Zhou Lei,et al. The portable batch scheduler and the maui scheduler on linux clusters , 2000 .

[46] Laxmikant V. Kale,et al. A tutorial introduction to charm , 1992 .

[47] CORPORATE Computer Science and Telecommunications Board,et al. Academic careers for experimental computer scientists and engineers , 1994, CACM.

[48] David Chappell,et al. Understanding ActiveX and OLE: a guide for developers and managers , 1996 .

[49] David F. Heidel,et al. An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[50] Jason Duell,et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[51] Nancy A. Lynch,et al. Impossibility of distributed consensus with one faulty process , 1985, JACM.

[52] Yi-Min Wang,et al. Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[53] Jyh-Jong Tsay,et al. Checkpointing Message-Passing Interface (MPI) parallel programs , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[54] Brian W. Barrett,et al. The system services interface (SSI) to LAM/MPI , 2003 .

[55] Willy Zwaenepoel,et al. The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[56] Marc Snir,et al. The Communication Software and Parallel Environment of the IBM SP2 , 1995, IBM Syst. J..

[57] Michel Raynal,et al. Consistent Checkpointing in Message Passing Distributed Systems , 1995 .

[58] James Arthur Kohl,et al. HARNESS: a next generation distributed virtual machine , 1999, Future Gener. Comput. Syst..

[59] Von Welch,et al. Fine-Grain Authorization Policies in the GRID: Design and Implementation , 2003, Middleware Workshops.

[60] Miron Livny,et al. Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[61] Ronald Minnich,et al. A Network-Failure-Tolerant Message-Passing System for Terascale Clusters , 2002, ICS '02.

[62] Péter Urbán,et al. Chasing the FLP impossibility result in a LAN: or, How robust can a fault tolerant server be? , 2001, Proceedings 20th IEEE Symposium on Reliable Distributed Systems.

[63] Dhabaleswar K. Panda,et al. Efficient collective operations using remote memory operations on VIA-based clusters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[64] Kai Li,et al. CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[65] Ian T. Foster,et al. Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[66] P. Merkey,et al. Beowulf: harnessing the power of parallelism in a pile-of-PCs , 1997, 1997 IEEE Aerospace Conference.

[67] Marvin Solomon,et al. The evolution of Condor checkpointing , 1999 .

[68] Taesoon Park,et al. Checkpointing and rollback-recovery in distributed systems , 1989 .

[69] Greg Burns,et al. LAM: An Open Cluster Environment for MPI , 2002 .

[70] Edward W. Felten,et al. Improving the performance of message-passing applications by multithreading , 1992, Proceedings Scalable High Performance Computing Conference SHPCC-92..

[71] Philip M. Papadopoulos,et al. NPACI: rocks: tools and techniques for easily deploying manageable Linux clusters , 2001, Proceedings 42nd IEEE Symposium on Foundations of Computer Science.

[72] Mark A. Taylor,et al. Architecture of LA-MPI, a network-fault-tolerant MPI , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[73] Ron Brightwell,et al. The Portals 3.0 Message Passing Interface Revision 1.0 , 1999 .

[74] Bronis R. de Supinski,et al. Exploiting hierarchy in parallel computer networks to optimize collective operation performance , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[75] Flaviu Cristian,et al. Reaching agreement on processor-group membrship in synchronous distributed systems , 1991, Distributed Computing.

[76] Henri E. Bal,et al. MagPIe: MPI's collective communication operations for clustered wide area systems , 1999, PPoPP '99.

[77] Thomas Hérault,et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[78] Sergei Gorlatch,et al. Send-receive considered harmful: Myths and realities of message passing , 2004, TOPL.

[79] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .

[80] Michael Franz. Dynamic Linking of Software Components , 1997, Computer.

[81] Augusto Ciuffoletti,et al. A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[82] Xin Yuan,et al. CC--MPI: a compiled communication capable MPI prototype for ethernet switched clusters , 2003, PPoPP '03.

[83] Jack J. Dongarra,et al. HARNESS and fault tolerant MPI , 2001, Parallel Comput..

[84] Clemens A. Szyperski,et al. Component software - beyond object-oriented programming , 2002 .

[85] Jack Dongarra,et al. PVM: Experiences, current status and future direction , 1993, Supercomputing '93. Proceedings.

[86] William Gropp,et al. Mpi - The Complete Reference: Volume 2, the Mpi Extensions , 1998 .

[87] Kai Li,et al. Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[88] Brian W. Barrett,et al. Request progression interface (RPI) system services interface (SSI) modules for LAM/MPI , 2003 .

[89] Ian T. Foster,et al. The Globus project: a status report , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).

[90] Richard Y. Kain,et al. Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks , 1992, IEEE Trans. Parallel Distributed Syst..

[91] Geoffrey James. The Tao of Programming , 1987 .

[92] Rajeev Thakur,et al. Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[93] Yuval Tamir,et al. ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS , 1984 .

[94] Jack Dongarra,et al. MPI: The Complete Reference , 1996 .

[95] W. Kent Fuchs,et al. Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[96] Jack J. Dongarra,et al. Visualization and debugging in a heterogeneous environment , 1993, Computer.

[97] Miron Livny,et al. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[98] Scott R. Kohn,et al. Toward a Common Component Architecture for High-Performance Scientific Computing , 1999, HPDC.

[99] Indranil Gupta,et al. On scalable and efficient distributed failure detectors , 2001, PODC '01.

[100] Sam Toueg,et al. The weakest failure detector for solving consensus , 1992, PODC '92.