The recent improvements in workstation and interconnection network performance have popularized the clusters of off-the-shelf workstations. However, the usefulness of these clusters is yet to be fully exploited, mostly due to the inadequate management of cluster resources implemented by current distributed operating systems. In order to eliminate this problem and approach the computational power of large clusters of workstations, in this paper we propose Nomad, an efficient operating system for clusters of uni and/or multiprocessors. Nomad includes several important characteristics for modern cluster-oriented operating systems: scalability, efficient resource management across the cluster, efficient scheduling of parallel and distributed applications, distributed I/O, fault detection and recovery, protection, and backward compatibility. Some of the mechanisms used by Nomad, such as process checkpointing and migration, can be found in previously proposed systems. However, our system stands out for its strategy for disseminating information across the cluster and its efficient management of all cluster resources. In addition, Nomad is highly scalable as it uses neither centralized control nor extra messages to implement its functionality, taking advantage of the I/O traffic associated with its distributed file system. Our preliminary evaluation of the load balancing aspect of Nomad shows that the pattern of file accesses in our distributed Ale system allows for efficient and scalable load balancing. Our main conclusion is that the complete implementation of Nomad will most likely be efficient and will be a nice platform for future research on operating systems for clusters of workstations.
[1]
Andrea C. Arpaci-Dusseau,et al.
Scheduling with implicit information in distributed systems
,
1998,
SIGMETRICS '98/PERFORMANCE '98.
[2]
John K. Ousterhout.
Scheduling Techniques for Concurrebt Systems.
,
1982,
ICDCS 1982.
[3]
Randy H. Katz,et al.
RAMA: An Easy-to-Use, High-Performance Parallel File System
,
1997,
Parallel Comput..
[4]
Amin Vahdat,et al.
GLUix: a global layer unix for a network of workstations
,
1998
.
[5]
Greg J. Regnier,et al.
The Virtual Interface Architecture
,
2002,
IEEE Micro.
[6]
John K. Ousterhout,et al.
Scheduling Techniques for Concurrent Systems
,
1982,
ICDCS.
[7]
Amin Vahdat,et al.
GLUix: a global layer unix for a network of workstations
,
1998,
Softw. Pract. Exp..
[8]
Fred Douglis,et al.
Transparent process migration: Design alternatives and the sprite implementation
,
1991,
Softw. Pract. Exp..
[9]
Amnon Barak,et al.
The MOSIX multicomputer operating system for high performance cluster computing
,
1998,
Future Gener. Comput. Syst..
[10]
Valmir Carneiro Barbosa,et al.
Time sharing in hypercube multiprocessors
,
1992,
[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing.
[11]
David R. Cheriton,et al.
The V distributed system
,
1988,
CACM.
[12]
José M. Bernabéu-Aubán,et al.
Solaris MC: A Multi Computer OS
,
1996,
USENIX Annual Technical Conference.