Discovering and Leveraging Content Similarity to Optimize Collective on-Demand Data Access to IaaS Cloud Storage

A critical feature of IaaS cloud computing is the ability to quickly disseminate the content of a shared dataset at large scale. In this context, a common pattern is collective on-demand read, i.e., accessing the same VM image or dataset from a large number of V Minstances concurrently. There are various techniques that avoid I/Ocontention to the storage service where the dataset is located without relying on pre-broadcast. Most such techniques employ peer-to-peer collaborative behavior where the VM instances exchange information about the content that was accessed during runtime, such that it impossible to fetch the missing data pieces directly from each other rather than the storage system. However, such techniques are often limited within a group that performs a collective read. In light of high data redundancy on large IaaS data centers and multiple users that simultaneously run VM instance groups that perform collective reads, an important opportunity arises: enabling unrelated VMinstances belonging to different groups to collaborate and exchange common data in order to further reduce the I/O pressure on the storage system. This paper deals with the challenges posed by such absolution, which prompt the need for novel techniques to efficiently detect and leverage common data pieces across groups. To this end, we introduce a low-overhead fingerprint based approach that we evaluate and demonstrate to be efficient in practice for a representative scenario on dozens of nodes and a variety of group configurations.

[1]  Bernd Freisleben,et al.  Efficient Distribution of Virtual Machines for Cloud Computing , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[2]  Chunyi Peng,et al.  An empirical analysis of similarity in virtual machine images , 2011, Middleware '11.

[3]  Wei-keng Liao,et al.  Scaling parallel I/O performance through I/O delegate and caching system , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Matei Ripeanu,et al.  VMFlock: virtual machine co-migration for the cloud , 2011, HPDC '11.

[5]  Marvin Theimer,et al.  Preemptable remote execution facilities for the V-system , 1985, SOSP '85.

[6]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[7]  Bogdan Nicolae,et al.  Leveraging Collaborative Content Exchange for On-Demand VM Multi-deployments in IaaS Clouds , 2013, Euro-Par.

[8]  Zhe Zhang,et al.  VMAR: Optimizing I/O Performance and Resource Utilization in the Cloud , 2013, Middleware.

[9]  Mike Hibler,et al.  USENIX Association Proceedings of the General Track : 2003 USENIX Annual , 2003 .

[10]  Ashiquee Rasool Mohammad,et al.  Going Back and Forth: Efficient Multi-deployment and Multi-snapshotting on Clouds , 2012 .

[11]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[12]  Gabriel Antoniu,et al.  Going back and forth: efficient multideployment and multisnapshotting on clouds , 2011, HPDC '11.

[13]  Hong Jiang,et al.  Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud , 2014, TOS.

[14]  Petros Efstathopoulos,et al.  Building a High-performance Deduplication System , 2011, USENIX Annual Technical Conference.

[15]  Eric Jul,et al.  Scalable virtual machine storage using local disks , 2010, OPSR.

[16]  Bernd Freisleben,et al.  Efficient storage of virtual machine images , 2012, ScienceCloud '12.

[17]  Sebastien Goasguen,et al.  Image Distribution Mechanisms in Large Scale Cloud Providers , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[18]  Franck Cappello,et al.  Optimizing Multi-deployment on Clouds by Means of Self-adaptive Prefetching , 2011, Euro-Par.

[19]  Ethan L. Miller,et al.  The effectiveness of deduplication on virtual machine disk images , 2009, SYSTOR '09.

[20]  Vishal Misra,et al.  VMTorrent: scalable P2P virtual machine streaming , 2012, CoNEXT '12.

[21]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[22]  Bogdan Nicolae,et al.  Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[23]  Raju Rangaswami,et al.  I/O Deduplication: Utilizing content similarity to improve I/O performance , 2010, TOS.

[24]  Thilo Kielmann,et al.  Squirrel: scatter hoarding VM image contents on IaaS compute nodes , 2014, HPDC '14.

[25]  Umesh Deshpande,et al.  Live gang migration of virtual machines , 2011, HPDC '11.

[26]  Michal Kaczmarczyk,et al.  HYDRAstor: A Scalable Secondary Storage , 2009, FAST.