Dynamic Adaptation for Elastic System Services Using Virtual Servers

A vast majority of legacy runtime systems and middleware prevalent in cluster and supercomputing environments are static in nature. Due to the rising scale and complexity of high-performance computing systems, the static nature of systems software would prospectively impede its scalability and resilience. Traditionally, the mobility of servers is further limited since services are statically bound to specific communication endpoints. To address these challenges imminent for exascale-class systems, distributed middleware needs to support dynamic reconfiguration, redundant and replicated state, and adaptation where the number of servers can vary according to the load in the system. We identify the key features necessary from the underlying network infrastructure to support dynamic adaptation and elasticity in distributed system software, and describe the implementation of a high-performance middleware library that implements the proposed interface. We discuss several novel approaches for dynamic resolution using range computations performed by hosts (in software) and by switches (in hardware), and compare the performance on contemporary Ethernet networks. Finally, we validate the benefits offered by our library with two different applications -- a scalable DHCP server and an elastic key-value store.

[1]  Nick McKeown,et al.  OpenFlow: enabling innovation in campus networks , 2008, CCRV.

[2]  Yi Wang,et al.  Virtual routers on the move: live router migration as a network-management primitive , 2008, SIGCOMM '08.

[3]  Leslie Lamport,et al.  Fast Paxos , 2006, Distributed Computing.

[4]  Laxmikant V. Kalé,et al.  Towards realizing the potential of malleable jobs , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[5]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[6]  environmet.,et al.  JXTA : A Network Programming Environment , 2022 .

[7]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[8]  Roberto Bifulco,et al.  ClickOS and the Art of Network Function Virtualization , 2014, NSDI.

[9]  Richard M. Karp,et al.  Load Balancing in Structured P2P Systems , 2003, IPTPS.

[10]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[11]  Charles E. Perkins,et al.  Mobility support in IPv6 , 1996, MobiCom '96.

[12]  Gagan Agrawal,et al.  A Framework for Elastic Execution of Existing MPI Programs , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[13]  Michael Lang,et al.  Using simulation to explore distributed key-value stores for extreme-scale system services , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[15]  Jennifer Rexford,et al.  Floodless in seattle: a scalable ethernet architecture for large enterprises , 2008, SIGCOMM '08.

[16]  Martín Casado,et al.  Extending Networking into the Virtualization Layer , 2009, HotNets.

[17]  胡雄 DHCP (Dynamic Host Configuration Protocol) address distribution method and system , 2012 .

[18]  Arun Venkataramani,et al.  Black-box and Gray-box Strategies for Virtual Machine Migration , 2007, NSDI.

[19]  Amin Vahdat,et al.  PortLand: a scalable fault-tolerant layer 2 data center network fabric , 2009, SIGCOMM '09.