论文信息 - Just-in-time Transparent Resource Management in Distributed Systems

Just-in-time Transparent Resource Management in Distributed Systems

This paper presents the design and the implementation of a resource management system for monitoring computing resources on a network and for dynamically allocating them to concurrently executing jobs. In particular, it is designed to support adaptive parallel computations|computations that beneet from addition of new machines, and can tolerate removal of machines while executing. The challenge for such a resource manager is to communicate the availability of resources to running programs even when the programs were not developed to work with external resource managers. Our main contribution is a novel mechanism addressing this issue, built on low-level features common to popular parallel programming systems. Existing resource management systems for adaptive computations either require tight integration with the operating system (DRMS), or require an integration with a programming system that is aware of external resource managers (e.g. Condor/CARMI, MPVM, Piranha). Thus in each case, their support is limited to a single type of programming system. In contrast, our resource management system is unique in supporting several unmodiied parallel programming systems. Furthermore, the system runs with user-level privilege, and thus can not compromise the security of the network. The underlying mechanism and the overall system have been validated on a dynamically changing mix of jobs, some sequential, some PVM, some MPI, and some Calypso computations. We demonstrate the feasibility and the usefulness of our approach, thus showing how to construct a middleware resource management system to enhance the utilizations of distributed systems.

Ayal Itzkovitz | Arash Baratloo | Yuan-Yuan Zhao | Zvi M. Kedem

[1] Marvin Theimer,et al. Finding idle machines in a workstation-based distributed system , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[2] Nicholas Carriero,et al. Linda in context , 1989, CACM.

[3] S. Yajnik,et al. Checkpointing in CosMiC: a user-level process migration environment , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[4] Partha Dasgupta,et al. CALYPSO: a novel software system for fault-tolerant parallel processing on distributed platforms , 1995, Proceedings of the Fourth IEEE International Symposium on High Performance Distributed Computing.

[5] Miron Livny,et al. Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[6] Jonathan Walpole,et al. MPVM: A Migration Transparent Version of PVM , 1995, Comput. Syst..

[7] Andrea C. Arpaci-Dusseau,et al. The interaction of parallel and sequential workloads on a network of workstations , 1995, SIGMETRICS '95/PERFORMANCE '95.

[8] Miron Livny,et al. Parallel Processing on Dynamic Resources with CARMI , 1995, JSSPP.

[9] James Patton Jones. Evaluation of Job Queuing/Scheduling Software: Phase I Report , 1996 .

[10] Fred Douglis,et al. Transparent process migration: Design alternatives and the sprite implementation , 1991, Softw. Pract. Exp..

[11] Amnon Barak,et al. The MOSIX Distributed Operating System: Load Balancing for UNIX , 1993 .

[12] Peter M. A. Sloot,et al. DynamicPVM - Dynamic Load Balancing on Parallel Systems , 1994, HPCN.

[13] Warren Smith,et al. A Resource Management Architecture for Metacomputing Systems , 1998, JSSPP.

[14] Miron Livny,et al. Interfacing Condor and PVM to harness the cycles of workstation clusters , 1996, Future Gener. Comput. Syst..

[15] Rakesh Agrawal,et al. Location Independent Remote Execution in NEST , 1987, IEEE Transactions on Software Engineering.

[16] Geoffrey C. Fox,et al. A Review of Commercial and Research Cluster Management Software , 1996 .

[17] José E. Moreira,et al. A Programming Environment for Dynamic Resource Allocation and Data Distribution , 1996, LCPC.

[18] Jingwen Wang,et al. Utopia: A load sharing facility for large, heterogeneous distributed computer systems , 1993, Softw. Pract. Exp..

[19] M. Rasit Eskicioglu,et al. A comprehensive bibliography of distributed shared memory , 1996, OPSR.

[20] M. Litzkow. REMOTE UNIX TURNING IDLE WORKSTATIONS INTO CYCLE SERVERS , 1992 .

[21] David Kaminsky. Adaptive parallelism with Piranha , 1995 .

[22] David A. Nichols,et al. Using idle workstations in a shared computing environment , 1987, SOSP '87.

[23] B. Clifford Neuman,et al. The Prospero Resource Manager: A scalable framework for processor allocation in distributed systems , 1994, Concurr. Pract. Exp..

[24] Miron Livny,et al. The Available Capacity of a Privately Owned Workstation Environmont , 1991, Perform. Evaluation.

[25] Leonard Kleinrock,et al. The Benevolent Bandit Laboratory: a testbed for distributed algorithms , 1989, IEEE J. Sel. Areas Commun..