Using Deduplicating Storage for Efficient Disk Image Deployment

Many clouds and network testbeds use disk images to initialize local storage on their compute devices. Large facilities must manage thousands or more images, requiring significant amounts of storage. At the same time, to provide a good user experience, they must be able to deploy those images quickly. Driven by our experience in operating the Emulab site at the University of Utah—a long-lived and heavily-used testbed—we have created a new service for efficiently storing and deploying disk images. This service exploits the redundant data found in similar images, using deduplication to greatly reduce the amount of physical storage required. In addition to space savings, our system is also designed for highly efficient image deployment—it integrates with an existing highly-optimized disk image deployment system, Frisbee, without significantly increasing the time required to distribute and install images. In this paper, we explain the design of our system and discuss the trade-offs we made to strike a balance between efficient storage and fast disk image deployment. We also propose a new chunking algorithm, called AFC, which enables fixed-size chunking for deduplicating allocated disk sectors. Experimental results show that our system reduces storage requirements by up to 3 while imposing only a negligible runtime overhead on the end-to-end disk-deployment process.

[1]  Bernd Freisleben,et al.  Efficient storage of virtual machine images , 2012, ScienceCloud '12.

[2]  Dutch T. Meyer,et al.  Parallax: virtual disks for virtual machines , 2008, Eurosys '08.

[3]  Raghuveer Pullakandam,et al.  EMUSTORE : LARGE SCALE DISK IMAGE STORAGE AND DEPLOYMENT IN THE EMULAB NETWORK TESTBED , 2014 .

[4]  Fred Douglis,et al.  Migratory compression: coarse-grained data reordering to improve compressibility , 2014, FAST.

[5]  Aleksey Pesterev,et al.  Fast, Inexpensive Content-Addressed Storage in Foundation , 2008, USENIX Annual Technical Conference.

[6]  Brian D. Noble,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Pastiche: Making Backup Cheap and Easy , 2022 .

[7]  Kui Wu,et al.  VMThunder: Fast Provisioning of Large-Scale Virtual Machine Clusters , 2014, IEEE Transactions on Parallel and Distributed Systems.

[8]  Mike Hibler,et al.  USENIX Association Proceedings of the General Track : 2003 USENIX Annual , 2003 .

[9]  Mike Hibler,et al.  Transparent checkpoints of closed distributed systems in Emulab , 2009, EuroSys '09.

[10]  Dutch T. Meyer,et al.  Capo: Recapitulating Storage for Virtual Desktops , 2011, FAST.

[11]  Mike Hibler,et al.  An integrated experimental environment for distributed systems and networks , 2002, OPSR.

[12]  Robert Ricci,et al.  Operational Experiences with Disk Imaging in a Multi-Tenant Datacenter , 2014, NSDI.

[13]  Yang Zhang,et al.  Liquid: A Scalable Deduplication File System for Virtual Machine Images , 2014, IEEE Transactions on Parallel and Distributed Systems.

[14]  Ethan L. Miller,et al.  The effectiveness of deduplication on virtual machine disk images , 2009, SYSTOR '09.

[15]  Chunyi Peng,et al.  An empirical analysis of similarity in virtual machine images , 2011, Middleware '11.

[16]  John C. S. Lui,et al.  Live Deduplication Storage of Virtual Machine Images in an Open-Source Cloud , 2011, Middleware.

[17]  Grant Wallace,et al.  Efficiently Storing Virtual Machine Backups , 2013, HotStorage.

[18]  Dutch T. Meyer,et al.  A study of practical deduplication , 2011, TOS.

[19]  Eric Jul,et al.  Lithium: virtual machine storage for the cloud , 2010, SoCC '10.

[20]  Zhe Zhang,et al.  VDN: Virtual machine image distribution network for cloud data centers , 2012, 2012 Proceedings IEEE INFOCOM.

[21]  Anthony Liguori,et al.  Experiences with Content Addressable Storage and Virtual Disks , 2008, Workshop on I/O Virtualization.

[22]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[23]  Bowen Alpern,et al.  Opening black boxes: using semantic information to combat virtual machine image sprawl , 2008, VEE '08.

[24]  Vishal Misra,et al.  VMTorrent: scalable P2P virtual machine streaming , 2012, CoNEXT '12.

[25]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.