Just-in-time Transparent Resource Management in Distributed Systems

This paper presents the design and the implementation of a resource management system for monitoring computing resources on a network and for dynamically allocating them to concurrently executing jobs. In particular, it is designed to support adaptive parallel computations|computations that beneet from addition of new machines, and can tolerate removal of machines while executing. The challenge for such a resource manager is to communicate the availability of resources to running programs even when the programs were not developed to work with external resource managers. Our main contribution is a novel mechanism addressing this issue, built on low-level features common to popular parallel programming systems. Existing resource management systems for adaptive computations either require tight integration with the operating system (DRMS), or require an integration with a programming system that is aware of external resource managers (e.g. Condor/CARMI, MPVM, Piranha). Thus in each case, their support is limited to a single type of programming system. In contrast, our resource management system is unique in supporting several unmodiied parallel programming systems. Furthermore, the system runs with user-level privilege, and thus can not compromise the security of the network. The underlying mechanism and the overall system have been validated on a dynamically changing mix of jobs, some sequential, some PVM, some MPI, and some Calypso computations. We demonstrate the feasibility and the usefulness of our approach, thus showing how to construct a middleware resource management system to enhance the utilizations of distributed systems.

[1]  Marvin Theimer,et al.  Finding idle machines in a workstation-based distributed system , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[2]  Nicholas Carriero,et al.  Linda in context , 1989, CACM.

[3]  S. Yajnik,et al.  Checkpointing in CosMiC: a user-level process migration environment , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[4]  Partha Dasgupta,et al.  CALYPSO: a novel software system for fault-tolerant parallel processing on distributed platforms , 1995, Proceedings of the Fourth IEEE International Symposium on High Performance Distributed Computing.

[5]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[6]  Jonathan Walpole,et al.  MPVM: A Migration Transparent Version of PVM , 1995, Comput. Syst..

[7]  Andrea C. Arpaci-Dusseau,et al.  The interaction of parallel and sequential workloads on a network of workstations , 1995, SIGMETRICS '95/PERFORMANCE '95.

[8]  Miron Livny,et al.  Parallel Processing on Dynamic Resources with CARMI , 1995, JSSPP.

[9]  James Patton Jones Evaluation of Job Queuing/Scheduling Software: Phase I Report , 1996 .

[10]  Fred Douglis,et al.  Transparent process migration: Design alternatives and the sprite implementation , 1991, Softw. Pract. Exp..

[11]  Amnon Barak,et al.  The MOSIX Distributed Operating System: Load Balancing for UNIX , 1993 .

[12]  Peter M. A. Sloot,et al.  DynamicPVM - Dynamic Load Balancing on Parallel Systems , 1994, HPCN.

[13]  Warren Smith,et al.  A Resource Management Architecture for Metacomputing Systems , 1998, JSSPP.

[14]  Miron Livny,et al.  Interfacing Condor and PVM to harness the cycles of workstation clusters , 1996, Future Gener. Comput. Syst..

[15]  Rakesh Agrawal,et al.  Location Independent Remote Execution in NEST , 1987, IEEE Transactions on Software Engineering.

[16]  Geoffrey C. Fox,et al.  A Review of Commercial and Research Cluster Management Software , 1996 .

[17]  José E. Moreira,et al.  A Programming Environment for Dynamic Resource Allocation and Data Distribution , 1996, LCPC.

[18]  Jingwen Wang,et al.  Utopia: A load sharing facility for large, heterogeneous distributed computer systems , 1993, Softw. Pract. Exp..

[19]  M. Rasit Eskicioglu,et al.  A comprehensive bibliography of distributed shared memory , 1996, OPSR.

[20]  M. Litzkow REMOTE UNIX TURNING IDLE WORKSTATIONS INTO CYCLE SERVERS , 1992 .

[21]  David Kaminsky Adaptive parallelism with Piranha , 1995 .

[22]  David A. Nichols,et al.  Using idle workstations in a shared computing environment , 1987, SOSP '87.

[23]  B. Clifford Neuman,et al.  The Prospero Resource Manager: A scalable framework for processor allocation in distributed systems , 1994, Concurr. Pract. Exp..

[24]  Miron Livny,et al.  The Available Capacity of a Privately Owned Workstation Environmont , 1991, Perform. Evaluation.

[25]  Leonard Kleinrock,et al.  The Benevolent Bandit Laboratory: a testbed for distributed algorithms , 1989, IEEE J. Sel. Areas Commun..