MSPML : Environnements de communication et tolerance aux pannes

Some problems require performance that only massively parallel computers offer, but their programming is still difficult. Works on functional programming and parallelism can be divided in two categories : explicit parallel extensions of functional languages – where resulting languages are either non-deterministic or non functional – and parallel implementations with functional semantics where resulting languages don’t express parallel algorithms directly and don’t allow the prediction of execution times. Algorithmic skeletons languages, in which only a finite set of operations (the skeletons) are run in parallel, constitutes an intermediate approach. Their functional semantics is explicit but their parallel operational semantics is implicit. The set of algorithmic skeletons has to be as complete as possible but it is often dependent on the domain of application. In a step which looks further into this intermediate position we find the objective to have universal languages in which the source code makes it possible to determine the cost. This last requirement requires that in the programs the places of the network of processors of the machine are explicit. Within this framework, MSPML library for Minimally Synchronous Parallel ML was born. It offers moreover an asynchronous semantics of evaluation, a property which proves very useful for unbalanced parallel programs. Communication environnements are at the base of the mechanism of communication for MSPML. In a first part, we discuss the implementation of a new mechanism of management of these environments, a mechanism which will solve the problems related to the asynchrony depth in particular for programs locally unbalanced. In a second part, we study the possibilities of equipping MSPML with a mechanism of fault-tolerance. An aspect which is essential today for the parallel and distributed systems. By this step we seek to give more reliability and availability to MSPML.

[1]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[2]  Viktor Vafeiadis,et al.  Acute: high-level programming language design for distributed computation , 2005, ICFP '05.

[3]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[4]  Simon L. Peyton Jones,et al.  Processing Transactions on GRIP, a Parallel Graph Reducer , 1993, PARLE.

[5]  José Luis Roda García,et al.  A new parallel model for the analysis of asynchronous algorithms , 2000, Parallel Comput..

[6]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[7]  Frédéric Loulergue,et al.  Functional Bulk Synchronous Parallel Programming using the BSMLlib Library , 2000 .

[8]  Yi-Min Wang,et al.  Reducing message logging overhead for log-based recovery , 1993, 1993 IEEE International Symposium on Circuits and Systems.

[9]  Lorenzo Alvisi,et al.  An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[10]  Frédéric Loulergue Distributed Evaluation of Functional BSP Programs , 2001, Parallel Process. Lett..

[11]  Frédéric Loulergue Implementation of a Functional Bulk Synchronous Parallel Programming Library , 2002, IASTED PDCS.

[12]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[13]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[14]  Mark Garland Hayden,et al.  The Ensemble System , 1998 .

[15]  Coromoto León,et al.  Predicting the performance of parallel programs , 2004, Parallel Comput..

[16]  Frédéric Loulergue,et al.  Semantics and Implementation of Minimally Synchronous Parallel ML , 2004 .

[17]  James S. Plank An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and , 1997 .

[18]  Frédéric Loulergue,et al.  A Polymorphic Type System for Bulk Synchronous Parallel ML , 2003, PaCT.

[19]  Soonhoi Ha,et al.  An Efficient Implementation of the BSP Programming Library for VIA , 2002, Parallel Process. Lett..

[20]  Roy Friedman,et al.  Virtual machine based heterogeneous checkpointing , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[21]  Francisco Almeida,et al.  Predicting the execution time of message passing models , 1999, Concurr. Pract. Exp..

[22]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[23]  Frédéric Loulergue,et al.  Management of Communication Environments for Minimally Synchronous Parallel ML , 2004, DAPSYS.

[24]  Frédéric Loulergue,et al.  A calculus of functional BSP programs , 2000, Sci. Comput. Program..

[25]  Prakash Panangaden,et al.  The Essence of Concurrent ML , 1997 .

[26]  Frédéric Loulergue,et al.  Parallel composition and bulk synchronous parallel functional programming , 2000, Scottish Functional Programming Workshop.

[27]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.