The Design and Operation of CloudLab

Given the highly empirical nature of research in cloud computing, networked systems, and related fields, testbeds play an important role in the research ecosystem. In this paper, we cover one such facility, CloudLab, which supports systems research by providing raw access to programmable hardware, enabling research at large scales, and creating a shared platform for repeatable research. We present our experiences designing CloudLab and operating it for four years, serving nearly 4,000 users who have run over 79,000 experiments on 2,250 servers, switches, and other pieces of datacenter equipment. From this experience, we draw lessons organized around two themes. The first set comes from analysis of data regarding the use of CloudLab: how users interact with it, what they use it for, and the implications for facility design and operation. Our second set of lessons comes from looking at the ways that algorithms used "under the hood," such as resource allocation, have important-- and sometimes unexpected--effects on user experience and behavior. These lessons can be of value to the designers and operators of IaaS facilities in general, systems testbeds in particular, and users who have a stake in understanding how these systems are built.

[1]  Franck Cappello,et al.  Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed , 2006, Int. J. High Perform. Comput. Appl..

[2]  Dejan S. Milojicic,et al.  Open Cirrus: A Global Cloud Computing Testbed , 2010, Computer.

[3]  Robert Ricci,et al.  A solver for the network testbed mapping problem , 2003, CCRV.

[4]  Robert Ricci,et al.  Designing a Federated Testbed as a Distributed System , 2012, TRIDENTCOM.

[5]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[6]  Robert Ricci,et al.  Trust as the Foundation of Resource Exchange in GENI , 2015, EAI Endorsed Trans. Security Safety.

[7]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[8]  Daniel C. Stanzione,et al.  Jetstream: performance, early experiences, and early results , 2016, XSEDE.

[9]  David E. Culler,et al.  PlanetLab: an overlay testbed for broad-coverage services , 2003, CCRV.

[10]  Jeff Chase,et al.  Self-recharging virtual currency , 2005, P2PECON '05.

[11]  Robert Ricci,et al.  Operational Experiences with Disk Imaging in a Multi-Tenant Datacenter , 2014, NSDI.

[12]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[13]  Shujia Zhou,et al.  Case study for running HPC applications in public clouds , 2010, HPDC '10.

[14]  Paul Marshall,et al.  Improving Utilization of Infrastructure Clouds , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[15]  Hua Li,et al.  Seeing into a Public Cloud: Monitoring the Massachusetts Open Cloud , 2016 .

[16]  Robert Ricci,et al.  Taming Performance Variability , 2018, OSDI.

[17]  Mike Hibler,et al.  An integrated experimental environment for distributed systems and networks , 2002, OSDI '02.

[18]  Garth A. Gibson,et al.  PRObE: A Thousand-Node Experimental Cluster for Computer Systems Research , 2013, login Usenix Mag..

[19]  Robert Ricci,et al.  The Part-Time Cloud: Enabling Balanced Elasticity Between Diverse Computing Environments , 2017 .

[20]  Nancy Wilkins-Diehr,et al.  Comet: Tales from the Long Tail: Two Years In and 10,000 Users Later , 2017, PEARC.

[21]  Lucas Nussbaum,et al.  Testbeds Support for Reproducible Research , 2017, Reproducibility@SIGCOMM.

[22]  Joe Mambretti,et al.  Next Generation Clouds, the Chameleon Cloud Testbed, and Software Defined Networking (SDN) , 2015, 2015 International Conference on Cloud Computing Research and Innovation (ICCCRI).

[23]  Torsten Hoefler,et al.  Scientific Benchmarking of Parallel Computing Systems Twelve ways to tell the masses when reporting performance results , 2017 .

[24]  Emmanuel Jeanvoine,et al.  Kadeploy3: Efficient and Scalable Operating System Provisioning , 2013, login Usenix Mag..

[25]  Mike Hibler,et al.  Large-scale Virtualization in the Emulab Network Testbed , 2008, USENIX ATC.

[26]  Rajkumar Buyya,et al.  Inter‐Cloud architectures and application brokering: taxonomy and survey , 2014, Softw. Pract. Exp..

[27]  Eric Eide,et al.  Introducing CloudLab: Scientific Infrastructure for Advancing Cloud Architectures and Applications , 2014, login Usenix Mag..

[28]  Leonard Kleinrock,et al.  Queueing Systems: Volume I-Theory , 1975 .

[29]  Emmanuel Jeanvoine,et al.  Kadeploy3: Efficient and Scalable Operating System Provisioning for HPC Clusters , 2012 .

[30]  Mike Hibler,et al.  Apt: A Platform for Repeatable Research in Computer Science , 2015, OPSR.

[31]  Robert Ricci,et al.  How to Build a Better Testbed: Lessons from a Decade of Network Experiments on Emulab , 2012, TRIDENTCOM.

[32]  Scott Shenker,et al.  Overcoming the Internet impasse through virtualization , 2005, Computer.

[33]  Leonard Kleinrock,et al.  Theory, Volume 1, Queueing Systems , 1975 .

[34]  Stephen A. Cook,et al.  The complexity of theorem-proving procedures , 1971, STOC.