Evil twins: two models for TCB reduction in HPC clusters

Traditional high performance computing systems require extensive management and suffer from security and configuration problems. This paper presents two generations of a cluster-management system that aims at making clusters as secure and self-managing as possible. The goal of the system is minimality: All nodes in a cluster are configured with a minimal software base consisting of a virtual machine monitor and a remote bootstrapping mechanism, and customers then buy access using a simple pre-paid token scheme. All necessary application software, including the operating system, is provided by the customer as a full virtual machine, and boot-strapped or migrated into the cluster. We have explored two different models for cluster control. The first, a decentralized push model ("Evil Man"1), requires direct network access to cluster nodes, each of which is running a truly minimal control plane implementation consisting of only a few hundred lines of C code. In the second, a centralized pull model ("Evil Twin"), nodes may be running behind NATs or firewalls, and are controlled by a centralized web service. A specially developed cache invalidation protocol is used for telling nodes when to reload their workload description from the centralized service.

[1]  Michael B. Jones,et al.  Herald: achieving a global event notification service , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[2]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[3]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[4]  Adam Dunkels,et al.  Full TCP/IP for 8-bit architectures , 2003, MobiSys '03.

[5]  Jonathan Adams,et al.  Design Evolution of the EROS Single-Level Store , 2002, USENIX Annual Technical Conference, General Track.

[6]  Jerome H. Saltier,et al.  Protection of information in computer systems , 1975, IEEE CSIT Newsletter.

[7]  Hugo Krawczyk,et al.  Keying Hash Functions for Message Authentication , 1996, CRYPTO.

[8]  Miguel Castro,et al.  Scribe: a large-scale and decentralized application-level multicast infrastructure , 2002, IEEE J. Sel. Areas Commun..

[9]  Jerome H. Saltzer,et al.  The protection of information in computer systems , 1975, Proc. IEEE.

[10]  Adi Shamir,et al.  PayWord and MicroMint: Two Simple Micropayment Schemes , 1996, Security Protocols Workshop.

[11]  Keith A. Lantz,et al.  Preemptable remote execution facilities for the V-system , 1985, SOSP 1985.

[12]  Greg Burns,et al.  LAM: An Open Cluster Environment for MPI , 2002 .

[13]  Emin Gün Sirer,et al.  Corona: A High Performance Publish-Subscribe System for the World Wide Web , 2006, NSDI.

[14]  David E. Culler,et al.  Operating Systems Support for Planetary-Scale Network Services , 2004, NSDI.

[15]  Edward R. Zayas,et al.  Attacking the process migration bottleneck , 1987, SOSP '87.

[16]  Arun Venkataramani,et al.  Black-box and Gray-box Strategies for Virtual Machine Migration , 2007, NSDI.

[17]  Steven Hand,et al.  Controlling the XenoServer Open Platform , 2003, 2003 IEEE Conference onOpen Architectures and Network Programming..

[18]  Jerome H. Saltzer,et al.  End-to-end arguments in system design , 1984, TOCS.

[19]  Jerome H. Saltzer,et al.  Protection and the control of information sharing in multics , 1974, CACM.

[20]  Dejan S. Milojicic,et al.  Process migration , 1999, ACM Comput. Surv..

[21]  Cullen Jennings,et al.  Network Address Translation (NAT) Behavioral Requirements for Unicast UDP , 2007, RFC.