Supporting Dynamic Space-sharing on Clusters Workstations *

Clusters of workstations are increasingly being viewed as a cost-effective alternative to parallel supercomputers. However, resource management and scheduling on workstations clusters is complicated by the fact that the number of idle workstations available for executing parallel applications is constantly fluctuating. In this paper, we present a case for scheduling parallel applications on non-dedicated workstation clusters using dynamic space-sharing, a policy under which the number of processors allocated to an application can be changed during its execution. We describe an approach that uses application-level checkpointing and data repartitioning for supporting dynamic spacesharing and for handling the dynamic reconfiguration triggered when failure or owner activity is detected on a workstation being used by a parallel application. The performance advantages of dynamic space-sharing are quantafied through a simulation study, and experimental results are presented for the overhead of dynamic reconfiguration of a grid-oriented data parallel application using our approach.

[1]  Sanjeev Setia,et al.  Supporting dynamic space-sharing on clusters of non-dedicated workstations , 1997, Proceedings of 17th International Conference on Distributed Computing Systems.

[2]  Mark A. Johnson,et al.  Solving problems on concurrent processors. Vol. 1: General techniques and regular problems , 1988 .

[3]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[4]  Erik Seligman,et al.  High-Level Fault Tolerance in Distributed Programs , 1994 .

[5]  Michael J. Quinn,et al.  Block data decomposition for data-parallel programming on a heterogeneous workstation network , 1993, [1993] Proceedings The 2nd International Symposium on High Performance Distributed Computing.

[6]  Andrea C. Arpaci-Dusseau,et al.  The interaction of parallel and sequential workloads on a network of workstations , 1995, SIGMETRICS '95/PERFORMANCE '95.

[7]  Miron Livny,et al.  Parallel Processing on Dynamic Resources with CARMI , 1995, JSSPP.

[8]  Peter Steenkiste,et al.  Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .

[9]  Miron Livny,et al.  The Available Capacity of a Privately Owned Workstation Environmont , 1991, Perform. Evaluation.

[10]  David E. Culler,et al.  A case for NOW (networks of workstation) , 1995, PODC '95.

[11]  V. K. Naik,et al.  Performance analysis of job scheduling policies in parallel supercomputing environments , 1993, Supercomputing '93.