REXEC: A Decentralized, Secure Remote Execution Environment for Clusters

Bringing clusters of computers into the mainstream as general-purpose computing systems requires that better facilities for transparent remote execution of parallel and sequential applications be developed. While much research has been done in this area, most of this work remains inaccessible for clusters built using contemporary hardware and operating systems. Implementations are either too old and/or not publicly available, require use of operating systems which are not supported by modern hardware, or simply do not meet the functional requirements demanded by practical use in real world settings. To address these issues, we designed REXEC, a decentralized, secure remote execution facility. It provides high availability, scalability, transparent remote execution, dynamic cluster configuration, decoupled node discovery and selection, a well-defined failure and cleanup model, parallel and distributed program support, and strong authentication and encryption. The system is implemented and is currently installed and in use on a 32-node cluster of 2-way SMPs running the Linux 2.2.5 operating system.

[1]  Jingwen Wang,et al.  Utopia: A load sharing facility for large, heterogeneous distributed computer systems , 1993, Softw. Pract. Exp..

[2]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[3]  Amin Vahdat,et al.  GLUix: a global layer unix for a network of workstations , 1998, Softw. Pract. Exp..

[4]  David A. Nichols,et al.  Using idle workstations in a shared computing environment , 1987, SOSP '87.

[5]  Marinho P. Barcellos,et al.  The HetNOS network operating system: a tool for writing distributed applications , 1994, OPSR.

[6]  Alan O. Freier,et al.  SSL Protocol Version 3.0 Internet Draft , 1996 .

[7]  Michael Stumm,et al.  The design and implementation of a decentralized scheduling facility for a workstation cluster , 1988, [1988] Proceedings. 2nd IEEE Conference on Computer Workstations.

[8]  Ken Shirriff,et al.  Building distributed process management on an object-oriented framework , 1997 .

[9]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[10]  Gaochao Xu,et al.  Parallel computing using idle workstations , 1993, OPSR.

[11]  David E. Culler,et al.  Market-based Proportional Resource Sharing for Clusters , 2000 .

[12]  José M. Bernabéu-Aubán,et al.  Solaris MC: A Multi Computer OS , 1996, USENIX Annual Technical Conference.

[13]  Kenneth P. Birman,et al.  A Local Network Based on the UNIX Operating System , 1982, IEEE Transactions on Software Engineering.

[14]  Fred Douglis,et al.  Transparent process migration: Design alternatives and the sprite implementation , 1991, Softw. Pract. Exp..

[15]  Andrew R. Cherenson,et al.  The Sprite network operating system , 1988, Computer.

[16]  Rakesh Agrawal,et al.  Location Independent Remote Execution in NEST , 1987, IEEE Transactions on Software Engineering.

[17]  Amnon Barak,et al.  Scalable Cluster Computing with MOSIX for LINUX , 1999 .

[18]  Bruce Walker,et al.  The LOCUS distributed operating system , 1983, SOSP '83.

[19]  Keith A. Lantz,et al.  Preemptable remote execution facilities for the V-system , 1985, SOSP 1985.

[20]  Carl A. Waldspurger,et al.  Stride Scheduling: Deterministic Proportional- Share Resource Management , 1995 .