论文信息 - Live Deduplication Storage of Virtual Machine Images in an Open-Source Cloud

Live Deduplication Storage of Virtual Machine Images in an Open-Source Cloud

Deduplication is an approach of avoiding storing data blocks with identical content, and has been shown to effectively reduce the disk space for storing multi-gigabyte virtual machine (VM) images. However, it remains challenging to deploy deduplication in a real system, such as a cloud platform, where VM images are regularly inserted and retrieved. We propose LiveDFS, a live deduplication file system that enables deduplication storage of VM images in an open-source cloud that is deployed under low-cost commodity hardware settings with limited memory footprints. LiveDFS has several distinct features, including spatial locality, prefetching of metadata, and journaling. LiveDFS is POSIX-compliant and is implemented as a Linux kernel-space file system. We deploy our LiveDFS prototype as a storage layer in a cloud platform based on OpenStack, and conduct extensive experiments. Compared to an ordinary file system without deduplication, we show that LiveDFS can save at least 40% of space for storing VM images, while achieving reasonable performance in importing and retrieving VM images. Our work justifies the feasibility of deploying LiveDFS in an open-source cloud.

[1] André Brinkmann,et al. dedupv1: Improving deduplication throughput using solid state drives (SSD) , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[2] Sean Quinlan,et al. Venti: A New Approach to Archival Storage , 2002, FAST.

[3] Michal Kaczmarczyk,et al. HYDRAstor: A Scalable Secondary Storage , 2009, FAST.

[4] Aleksey Pesterev,et al. Fast, Inexpensive Content-Addressed Storage in Foundation , 2008, USENIX Annual Technical Conference.

[5] Cezary Dubnicki,et al. Bimodal Content Defined Chunking for Backup Streams , 2010, FAST.

[6] Burton H. Bloom,et al. Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[7] Cezary Dubnicki,et al. HydraFS: A High-Throughput File System for the HYDRAstor Content-Addressable Storage System , 2010, FAST.

[8] Eric Jul,et al. Lithium: virtual machine storage for the cloud , 2010, SoCC '10.

[9] Irfan Ahmad,et al. Decentralized Deduplication in SAN Cluster File Systems , 2009, USENIX Annual Technical Conference.

[10] Mahadev Satyanarayanan,et al. Design Tradeoffs in Applying Content Addressable Storage to Enterprise-scale Systems Based on Virtual Machines , 2006, USENIX Annual Technical Conference, General Track.

[11] Hovav Shacham,et al. Hey, you, get off of my cloud: exploring information leakage in third-party compute clouds , 2009, CCS.

[12] Mark Lillibridge,et al. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[13] Jin Li,et al. ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory , 2010, USENIX Annual Technical Conference.

[14] Richard Wolski,et al. The Eucalyptus Open-Source Cloud-Computing System , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[15] Kai Li,et al. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[16] Mark Lillibridge,et al. Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[17] Ethan L. Miller,et al. The effectiveness of deduplication on virtual machine disk images , 2009, SYSTOR '09.

[18] Randy H. Katz,et al. A view of cloud computing , 2010, CACM.

[19] Anthony Liguori,et al. Experiences with Content Addressable Storage and Virtual Disks , 2008, Workshop on I/O Virtualization.