DedupeSwift: Object-Oriented Storage System Based on Data Deduplication

Recent years have witnessed the explosion of the data universe. Facing the rapid growth of the data size, cloud storage is proposed as an approach to provide cost-efficient and reliable data storage service. As data size grows, data centers providing cloud storage service need more storage resources to meet the ever-increasing requirements. Data deduplication is a technology aiming to remove redundant data blocks. It has been used to reduce the storage footprint of backup and archival systems. In this paper, we propose DedupeSwift, which is based on OpenStack Swift, an open-source object-oriented storage software widely used in public and private clouds. Data deduplication is introduced to reduce the storage overhead. To deal with the performance overhead brought by deduplication, a lazy method is introduced to reduce the disk I/O bottleneck. Compression and caching are also used in the system to improve the read performance. Experimental results show that our proposed DedupeSwift can reduce the storage overhead by 65.24% and 89.84% on the two data sets with favorable upload and download throughput.

[1]  N. B. Anuar,et al.  The rise of "big data" on cloud computing: Review and open research issues , 2015, Inf. Syst..

[2]  Matei Ripeanu,et al.  stdchk: A Checkpoint Storage System for Desktop Grid Computing , 2007, 2008 The 28th International Conference on Distributed Computing Systems.

[3]  André Brinkmann,et al.  Design of an exact data deduplication cluster , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[4]  Nesrine Kaaniche,et al.  A Secure Client Side Deduplication Scheme in Cloud Storage Environments , 2014, 2014 6th International Conference on New Technologies, Mobility and Security (NTMS).

[5]  Hong Jiang,et al.  DEBAR: A scalable high-performance de-duplication storage system for backup and archiving , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[6]  Sanjeev Kumar,et al.  Finding a Needle in Haystack: Facebook's Photo Storage , 2010, OSDI.

[7]  Hong Jiang,et al.  A Scalable Inline Cluster Deduplication Framework for Big Data Protection , 2012, Middleware.

[8]  Jin Li,et al.  A Hybrid Cloud Approach for Secure Authorized Deduplication , 2015, IEEE Transactions on Parallel and Distributed Systems.

[9]  Raju Rangaswami,et al.  I/O Deduplication: Utilizing content similarity to improve I/O performance , 2010, TOS.

[10]  Gang Wang,et al.  Towards Fast De-duplication Using Low Energy Coprocessor , 2010, 2010 IEEE Fifth International Conference on Networking, Architecture, and Storage.

[11]  Xin Li,et al.  A highly parallel GPU-based hash accelerator for a data deduplication system , 2009 .

[12]  Ethan L. Miller,et al.  The effectiveness of deduplication on virtual machine disk images , 2009, SYSTOR '09.

[13]  João Paulo,et al.  A Survey and Classification of Storage Deduplication Systems , 2014, ACM Comput. Surv..

[14]  André Brinkmann,et al.  dedupv1: Improving deduplication throughput using solid state drives (SSD) , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[15]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[16]  Jin Li,et al.  ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory , 2010, USENIX Annual Technical Conference.

[17]  Kai Li,et al.  Tradeoffs in Scalable Data Routing for Deduplication Clusters , 2011, FAST.

[18]  Rasmus Pagh,et al.  Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[19]  Akshat Verma,et al.  Shredder: GPU-accelerated incremental storage and computation , 2012, FAST.

[20]  Frederic P. Miller,et al.  Linux kernel , 2009 .

[21]  Timothy Bisson,et al.  iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[22]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[23]  Gang Wang,et al.  Lazy exact deduplication , 2016, 2016 32nd Symposium on Mass Storage Systems and Technologies (MSST).

[24]  Hong Jiang,et al.  SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput , 2011, USENIX Annual Technical Conference.

[25]  Bofeng Zhang,et al.  Comparison of Several Cloud Computing Platforms , 2009, 2009 Second International Symposium on Information Science and Engineering.

[26]  Jason Nieh,et al.  Proceedings of the 2011 USENIX conference on USENIX annual technical conference , 2011 .

[27]  Sudipta Sengupta,et al.  Primary Data Deduplication - Large Scale Study and System Design , 2012, USENIX Annual Technical Conference.

[28]  André Brinkmann,et al.  A study on data deduplication in HPC storage systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Dutch T. Meyer,et al.  A study of practical deduplication , 2011, TOS.

[30]  Aiko Pras,et al.  Inside dropbox: understanding personal cloud storage services , 2012, Internet Measurement Conference.

[31]  Irfan Ahmad,et al.  Decentralized Deduplication in SAN Cluster File Systems , 2009, USENIX Annual Technical Conference.

[32]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[33]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.