DedupCloud: an optimized efficient virtual machine deduplication algorithm in cloud computing environment

Abstract The growth of IoT technology and other media is generating a large volume of duplicate data in cloud storage and creates a load on it. Data deduplication is an efficient mechanism for virtual machines (VMs) to eliminate redundant data that improves storage utilization and the cost of storage is reduced as well. DedupCloud, uses a byte comparison technique to minimize the need for hash value computations and also the subsequent comparisons. This chapter proposes a technique which calculates and compares the hash value within the similar category only. It economizes the space required to save the hash values during deduplication as well as helping to make the VM images faster. The results show a benefit of 80% that is achieved in the time required for the deduplication of VM images. Also, there is a saving in storage and in metadata.

[1]  Hong Jiang,et al.  SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput , 2011, USENIX Annual Technical Conference.

[2]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[3]  João Paulo,et al.  A Survey and Classification of Storage Deduplication Systems , 2014, ACM Comput. Surv..

[4]  Samira Manabi Khan,et al.  Last-level cache deduplication , 2014, ICS '14.

[5]  Dan Feng,et al.  Efficient Storage Support for Real-Time Near-Duplicate Video Retrieval , 2014, ICA3PP.

[6]  João Paulo,et al.  Distributed Exact Deduplication for Primary Storage Infrastructures , 2014, DAIS.

[7]  Zhanhuai Li,et al.  Data deduplication techniques , 2010, 2010 International Conference on Future Information Technology and Management Engineering.

[8]  Pin Zhou,et al.  Demystifying data deduplication , 2008, Companion '08.

[9]  Kejie Lu,et al.  Whispers in the cloud storage: A novel cross-user deduplication-based covert channel design , 2018, Peer-to-Peer Netw. Appl..

[10]  Wei Zhang,et al.  VM-centric snapshot deduplication for cloud data backup , 2015, 2015 31st Symposium on Mass Storage Systems and Technologies (MSST).

[11]  Raju Rangaswami,et al.  I/O Deduplication: Utilizing content similarity to improve I/O performance , 2010, TOS.

[12]  Irfan Ahmad,et al.  Decentralized Deduplication in SAN Cluster File Systems , 2009, USENIX Annual Technical Conference.

[13]  Timothy Bisson,et al.  iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[14]  Hong Jiang,et al.  Ddelta: A deduplication-inspired fast delta compression approach , 2014, Perform. Evaluation.

[15]  Roberto Di Pietro,et al.  Proof of ownership for deduplication systems: A secure, scalable, and efficient solution , 2016, Comput. Commun..

[16]  K. Siva Sankar,et al.  Framework of Data Deduplication: A Survey , 2015 .

[17]  Hong Jiang,et al.  Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud , 2014, TOS.

[18]  Hong Jiang,et al.  Application-Aware Local-Global Source Deduplication for Cloud Backup Services of Personal Storage , 2014, IEEE Transactions on Parallel and Distributed Systems.

[19]  Tao Li,et al.  Characterizing the efficiency of data deduplication for big data storage management , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[20]  Jie Yao,et al.  HPDV:A Highly Parallel Deduplication Cluster for Virtual Machine Images , 2018, 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[21]  Jun Wei,et al.  A Lightweight Virtual Machine Image Deduplication Backup Approach in Cloud Environment , 2014, 2014 IEEE 38th Annual Computer Software and Applications Conference.

[22]  Tao Huang,et al.  Clustering-based acceleration for virtual machine image deduplication in the cloud environment , 2016, J. Syst. Softw..

[23]  Takashi Watanabe,et al.  DBLK: Deduplication for primary block storage , 2011, 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).

[24]  Indranil Gupta,et al.  VMDedup: Memory De-duplication in Hypervisor , 2014, 2014 IEEE International Conference on Cloud Engineering.

[25]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[26]  Hong Jiang,et al.  A Comprehensive Study of the Past, Present, and Future of Data Deduplication , 2016, Proceedings of the IEEE.

[27]  Eric Jul,et al.  Lithium: virtual machine storage for the cloud , 2010, SoCC '10.

[28]  Hong Jiang,et al.  DARE: A Deduplication-Aware Resemblance Detection and Elimination Scheme for Data Reduction with Low Overheads , 2016, IEEE Transactions on Computers.

[29]  Nazatul Aini Abd Majid,et al.  Deduplication Image Middleware Detection Comparison In Standalone Cloud Database , 2016 .

[30]  Chunyi Peng,et al.  An empirical analysis of similarity in virtual machine images , 2011, Middleware '11.

[31]  Hong Jiang,et al.  AE: An Asymmetric Extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[32]  Eduardo N. Borges,et al.  An unsupervised heuristic-based approach for bibliographic metadata deduplication , 2011, Inf. Process. Manag..

[33]  Craig A. Knoblock,et al.  A Survey of Digital Map Processing Techniques , 2014, ACM Comput. Surv..

[34]  Sherif Sakr,et al.  HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud , 2017, ArXiv.

[35]  Seung-Ho Lim,et al.  Deduplication flash file system with PRAM for non-linear editing , 2010, IEEE Transactions on Consumer Electronics.

[36]  Yuhui Deng,et al.  Leverage similarity and locality to enhance fingerprint prefetching of data deduplication , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[37]  Carl A. Waldspurger,et al.  Memory resource management in VMware ESX server , 2002, OSDI '02.

[38]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[39]  Min Xu,et al.  Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage , 2014, TOS.

[40]  Ethan L. Miller,et al.  The effectiveness of deduplication on virtual machine disk images , 2009, SYSTOR '09.

[41]  C. Chandrasekar,et al.  A SURVEY ON DEDUPLICATION METHODS , 2012 .

[42]  Yang Zhang,et al.  Liquid: A Scalable Deduplication File System for Virtual Machine Images , 2014, IEEE Transactions on Parallel and Distributed Systems.

[43]  Hong Jiang,et al.  MAD2: A scalable high-throughput exact deduplication approach for network backup services , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[44]  Prateek Sharma,et al.  Singleton: system-wide page deduplication in virtual environments , 2012, HPDC '12.

[45]  Hong Jiang,et al.  P-Dedupe: Exploiting Parallelism in Data Deduplication System , 2012, 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage.

[46]  Rabindra K. Barik,et al.  Dynamic Dedicated Server Allocation for Service Oriented Multi-Agent Data Intensive Architecture in Biomedical and Geospatial Cloud , 2014 .

[47]  Michal Kaczmarczyk,et al.  HYDRAstor: A Scalable Secondary Storage , 2009, FAST.

[48]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[49]  Veena Goswami,et al.  Performance analysis of cloud with queue-dependent virtual machines , 2012, 2012 1st International Conference on Recent Advances in Information Technology (RAIT).

[50]  Min Gu,et al.  Optical storage arrays: a perspective for future big data storage , 2014, Light: Science & Applications.

[51]  John C. S. Lui,et al.  Live Deduplication Storage of Virtual Machine Images in an Open-Source Cloud , 2011, Middleware.

[52]  Ethan L. Miller,et al.  HANDS: A heuristically arranged non-backup in-line deduplication system , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[53]  Feng Xia,et al.  Virtual machine migration in cloud data centers: a review, taxonomy, and open research issues , 2015, The Journal of Supercomputing.

[54]  Yutaka Ishikawa,et al.  Utilizing Memory Content Similarity for Improving the Performance of Replicated Virtual Machines , 2011, 2011 Fourth IEEE International Conference on Utility and Cloud Computing.

[55]  Senthil Shanmugasundaram,et al.  A Comparative Study Of Text Compression Algorithms , 2011 .

[56]  Longjun Liu,et al.  HOPE: Enabling Efficient Service Orchestration in Software-Defined Data Centers , 2016, ICS.

[57]  Wei Zhang,et al.  Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage , 2013, HotStorage.