MPI Farm Programs on Non-dedicated Clusters

MPI has been extremely successful. In areas like e.g. particle physics most of the available parallel programs are based on MPI. Unfortunately, they must be run in dedicated clusters or parallel machines, being unable to use for long running applications the growing pool of idle time of general-purpose desktop computers. Additionally, MPI offers a quite low level interface, which is hard to use for most scientist programmers. In the research described in this paper, we tried to see how far we could go to solve those two problems, keeping the portability of MPI programs, but drawing upon one restriction – only programs following the FARM paradigm were to be supported. The developed library – MpiFL – did provide us significant insight. It is now being successfully used at the physics department of the University of Coimbra, despite some shortcomings.

[1]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[2]  Miron Livny,et al.  Condor: a distributed job scheduler , 2001 .

[3]  Jeff T. Linderoth,et al.  Master–Worker: An Enabling Framework for Applications on the Computational Grid , 2001, Cluster Computing.

[4]  Luís Moura Silva,et al.  System-level versus user-defined checkpointing , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[5]  Luís Moura Silva,et al.  Portable checkpointing and recovery , 1995, Proceedings of the Fourth IEEE International Symposium on High Performance Distributed Computing.

[6]  D. Marini,et al.  Transputer Applications and Systems '94 , 1994 .

[7]  Herbert Kuchen,et al.  A Skeleton Library , 2002, Euro-Par.

[8]  D J Evans,et al.  Parallel processing , 1986 .

[9]  Jack Dongarra,et al.  Recent Advances in Parallel Virtual Machine and Message Passing Interface, 15th European PVM/MPI Users' Group Meeting, Dublin, Ireland, September 7-10, 2008. Proceedings , 2008, PVM/MPI.

[10]  Adrianos Lachanas,et al.  MPI-FT: Portable Fault Tolerance Scheme for MPI , 2000, Parallel Process. Lett..

[11]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[12]  William Gropp,et al.  Beowulf Cluster Computing with Linux , 2003 .

[13]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[14]  Rajkumar Buyya,et al.  Parallel Programming Models and Paradigms , 1998 .

[15]  Anthony Skjellum,et al.  MPI/FT/sup TM/: architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[16]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[17]  Rajkumar Buyya,et al.  High Performance Cluster Computing: Architectures and Systems , 1999 .

[18]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[19]  S. R. Chapple,et al.  The Parallel Utilities Library , 1994, Proceedings Scalable Parallel Libraries Conference.