The Hector Distributed Run-Time Environment

Harnessing the computational capabilities of a network of workstations promises to off-load work from overloaded supercomputers onto largely idle resources overnight. Several capabilities are needed to do this, including support for an architecture-independent parallel programming environment, task migration, automatic resource allocation, and fault tolerance. The Hector distributed run-time environment is designed to present these capabilities transparently to programmers. MPI programs can be run under this environment on homogeneous clusters with no modifications to their source code needed. The design of Hector, its internal structure, and several benchmarks and tests are presented.

[1]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[2]  Samuel H. Russ An Architecture for Rapid Distributed Fault Tolerance , 1998, IPPS/SPDP Workshops.

[3]  Richard Gibbons,et al.  A Historical Application Profiler for Use by Parallel Schedulers , 1997, JSSPP.

[4]  Jonathan Robinson,et al.  Hector: automated task allocation for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[5]  Jonathan Robinson,et al.  A task migration implementation of the Message-Passing Interface , 1996, Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing.

[6]  Rachid Guerraoui,et al.  Software-Based Replication for Fault Tolerance , 1997, Computer.

[7]  Dror G. Feitelson,et al.  Utilization and Predictability in Scheduling the IBM SP2 with Backfilling , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[8]  Jack J. Dongarra,et al.  Visualization and debugging in a heterogeneous environment , 1993, Computer.

[9]  Jonathan Robinson,et al.  Hector: an agent based architecture for dynamic resource management , 1999, IEEE Concurr..

[10]  Luís Moura Silva,et al.  Portable checkpointing and recovery , 1995, Proceedings of the Fourth IEEE International Symposium on High Performance Distributed Computing.

[11]  Miron Livny,et al.  Managing Checkpoints for Parallel Programs , 1996, JSSPP.

[12]  Jonathan Walpole,et al.  MPVM: A Migration Transparent Version of PVM , 1995, Comput. Syst..

[13]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[14]  B. Clifford Neuman,et al.  The Prospero Resource Manager: A scalable framework for processor allocation in distributed systems , 1994, Concurr. Pract. Exp..

[15]  Miron Livny,et al.  Parallel Processing on Dynamic Resources with CARMI , 1995, JSSPP.

[16]  Uwe Schwiegelshohn,et al.  Theory and Practice in Parallel Job Scheduling , 1997, JSSPP.

[17]  Samuel H. Russ,et al.  Using Hector to run MPI programs over networked workstations , 1999 .

[18]  Thu D. Nguyen,et al.  Using Runtime Measured Workload Characteristics in Parallel Processor Scheduling , 1996, JSSPP.

[19]  F. H. Mcmahon,et al.  The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[20]  Jonathan Walpole,et al.  MIST: PVM with Transparent Migration and Checkpointing , 1995 .

[21]  Geoffrey C. Fox,et al.  Cluster Computing Review , 1995 .

[22]  Xian-He Sun,et al.  Memory space representation for heterogeneous network process migration , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[23]  Ewing L. Lusk,et al.  Monitors, Messages, and Clusters: The p4 Parallel Programming System , 1994, Parallel Comput..