Wide-area Nile: a case study of a wide-area data-parallel application

The Nile system is a distributed environment for running very large, data-intensive applications across a network of commodity workstations. These applications process data from elementary particle collisions, generated by the Cornell Electron Storage Ring, and are used by physicists of the CLEO experiment. The applications have a simple data-parallel structure, and so Nile executes them using as much parallelism as is available. Nile currently runs at any single site. It is being used by alpha testers and is scheduled for beta release in March 1998. We describe how we are adapting this local-area Nile system to allow for wide-area, multiple site interactions. In particular, we consider the two problems of scaling and of fault tolerance.

[1]  Alberto Montresor,et al.  System support for partition-aware network applications , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[2]  Francine Berman,et al.  Scheduling from the perspective of the application , 1996, Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing.

[3]  Michael Mitzenmacher,et al.  How Useful Is Old Information? , 2000, IEEE Trans. Parallel Distributed Syst..

[4]  Keith Marzullo,et al.  NILE: wide-area computing for high energy physics , 1996, EW 7.

[5]  Dexter Kozen,et al.  The Design and Analysis of Algorithms , 1991, Texts and Monographs in Computer Science.

[6]  Silvano Maffeis,et al.  ELECTRA: making distributed programs object-oriented , 1993 .

[7]  Robert E. Tarjan,et al.  Improved Algorithms for Bipartite Network Flow , 1994, SIAM J. Comput..

[8]  André Schiper,et al.  Uniform actions in asynchronous distributed systems , 1994, PODC '94.

[9]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[10]  Robbert van Renesse,et al.  Reliable Distributed Computing with the Isis Toolkit , 1994 .

[11]  H. G. Rotithor Taxonomy of dynamic task scheduling schemes in distributed computing systems , 1994 .