An execution service for a partitionable low bandwidth network

As the amount of scientific data grows to the point where the Internet bandwidth no longer supports its transfer it becomes necessary to make powerful computational services available near data repositories. Such services allow remote researchers to start long-running parallel computations on the data. Current execution services do not provide remote users with adequate management facilities for this style of computing. This paper describes the PEX system. It has an architecture based on partitionable group communication. We describe how PEX maintains replicated state in the face of processor failures and network partitions, and how it allows remote clients to manipulate this state. We present some performance numbers, and close with discussing related work.

[1]  David E. Culler,et al.  A case for NOW (networks of workstation) , 1995, PODC '95.

[2]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[3]  David S. Munro,et al.  In: Software-Practice and Experience , 2000 .

[4]  Alberto Montresor,et al.  System support for partition-aware network applications , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[5]  Bruce M. McMillin,et al.  DAWGS - A Distributed Compute Server Utilizing Idle Workstations , 1992, J. Parallel Distributed Comput..

[6]  Fred B. Schneider,et al.  Replication management using the state-machine approach , 1993 .

[7]  Walter F. Tichy,et al.  Rcs — a system for version control , 1985, Softw. Pract. Exp..

[8]  Bruce Walker,et al.  The LOCUS distributed operating system , 1983, SOSP '83.

[9]  Mahadev Satyanarayanan,et al.  Coda: A Highly Available File System for a Distributed Workstation Environment , 1990, IEEE Trans. Computers.

[10]  David A. Nichols,et al.  Using idle workstations in a shared computing environment , 1987, SOSP '87.

[11]  F. Tandiary,et al.  Batrun: utilizing idle workstations for large scale computing , 1996, IEEE Parallel Distributed Technol. Syst. Appl..

[12]  Barton P. Miller,et al.  Process migration in DEMOS/MP , 1983, SOSP '83.

[13]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[14]  Michael Mitzenmacher,et al.  How Useful Is Old Information? , 2000, IEEE Trans. Parallel Distributed Syst..

[15]  Amr El Abbadi,et al.  Maintaining availability in partitioned replicated databases , 1987, ACM Trans. Database Syst..

[16]  Miron Livny,et al.  The DEC: processing scientific data over the Internet , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[17]  Pedro Antunes,et al.  Enhancing Dependability of Cooperative Applications in Partitionable Environments , 1996, EDCC.

[18]  Fred B. Schneider,et al.  The primary-backup approach , 1993 .

[19]  Mark Garland Hayden,et al.  The Ensemble System , 1998 .

[20]  Keith A. Lantz,et al.  Preemptable remote execution facilities for the V-system , 1985, SOSP 1985.

[21]  Michael Ogg,et al.  The NILE System Architecture: Fault-Tolerant, Wide-Area Access to Computing and Data Resources , 1996 .

[22]  Ii Richard George Guy,et al.  FICUS: a very large scale reliable distributed file system , 1992 .

[23]  M. Frans Kaashoek,et al.  Rover: a toolkit for mobile information access , 1995, SOSP.

[24]  Marvin Theimer,et al.  Flexible update propagation for weakly consistent replication , 1997, SOSP.

[25]  Miron Livny,et al.  A worldwide flock of Condors: Load sharing among workstation clusters , 1996, Future Gener. Comput. Syst..

[26]  Takako Hickey Availability and Consistency in a Partitionable Low Bandwidth Network , 1998 .

[27]  Hector Garcia-Molina,et al.  Consistency in a partitioned network: a survey , 1985, CSUR.

[28]  Michael Mitzenmacher,et al.  How useful is old information (extended abstract)? , 1997, PODC '97.