Hadoop Based Scalable Cluster Deduplication for Big Data

The exponential growth of data has brought a tremendous challenge on the storage system in data center. Data deduplication technology which detects and eliminates redundant data in the dataset can greatly reduce the quantity of data and optimize the utilization of storage space. This paper presented a scalable and reliable cluster deduplication system Halodedu over the Hadoop-based cloud computing platform. Halodedu used MapReduce and HDFS to realize parallel deduplication processing and manage data storage, respectively. Intra-node local database was used to build up a fast and distributed chunk fingerprint index management. In order to maintain the availability and reliability of metadata, HBase was utilized to store the metadata of backup files. We further used virtual machine images as input dataset to evaluate Halodedu. The comparative experiments demonstrated that Halodedu has improvements on deduplication speed and system scalability.

[1]  Irfan Ahmad,et al.  Decentralized Deduplication in SAN Cluster File Systems , 2009, USENIX Annual Technical Conference.

[2]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[3]  Zhe Sun,et al.  A novel approach to data deduplication over the engineering-oriented cloud systems , 2013, Integr. Comput. Aided Eng..

[4]  André Brinkmann,et al.  Design of an exact data deduplication cluster , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[5]  Michal Kaczmarczyk,et al.  HYDRAstor: A Scalable Secondary Storage , 2009, FAST.

[6]  Petros Efstathopoulos,et al.  Building a High-performance Deduplication System , 2011, USENIX Annual Technical Conference.

[7]  Matthew John,et al.  Distributed Duplicate Detection in Post-Process Data De-duplication , 2011 .

[8]  Ruay-Shiung Chang,et al.  Dynamic Deduplication Decision in a Hadoop Distributed File System , 2014, Int. J. Distributed Sens. Networks.

[9]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[10]  Liu Fang,et al.  Research and Development on Key Techniques of Data Deduplication , 2012 .

[11]  André Brinkmann,et al.  A study on data deduplication in HPC storage systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Dirk Meister Advanced data deduplication techniques and their application , 2013 .

[13]  Andreas Thor,et al.  Dedoop: Efficient Deduplication with Hadoop , 2012, Proc. VLDB Endow..

[14]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[15]  Hong Jiang,et al.  SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput , 2011, USENIX Annual Technical Conference.