DeDu: Building a deduplication storage system over cloud computing

This paper presents a deduplication storage system over cloud computing. Our deduplication storage system consists of two major components, a front-end deduplication application and Hadoop Distributed File System. Hadoop Distributed File System is common back-end distribution file system, which is used with a Hadoop database. We use Hadoop Distributed File System to build up a mass storage system and use a Hadoop database to build up a fast indexing system. With the deduplication applications, a scalable and parallel deduplicated cloud storage system can be effectively built up. We further use VMware to generate a simulated cloud environment. The simulation results demonstrate that our deduplication cloud storage system is more efficient than traditional deduplication approaches.

[1]  Bin Zhou,et al.  Scalable Performance of the Panasas Parallel File System , 2008, FAST.

[2]  Hong Jiang,et al.  MAD2: A scalable high-throughput exact deduplication approach for network backup services , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[3]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[4]  John Black,et al.  Compare-by-Hash: A Reasoned Analysis , 2006, USENIX Annual Technical Conference, General Track.

[5]  Gregory R. Ganger,et al.  Ursa minor: versatile cluster-based storage , 2005, FAST'05.

[6]  Arif Merchant,et al.  FAB: building distributed enterprise disk arrays from commodity components , 2004, ASPLOS XI.

[7]  Irfan Ahmad,et al.  Decentralized Deduplication in SAN Cluster File Systems , 2009, USENIX Annual Technical Conference.

[8]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[9]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[10]  Chandramohan A. Thekkath,et al.  Petal: distributed virtual disks , 1996, ASPLOS VII.

[11]  Michal Kaczmarczyk,et al.  HYDRAstor: A Scalable Secondary Storage , 2009, FAST.

[12]  Sean Matthew Dorward,et al.  Awarded Best Paper! - Venti: A New Approach to Archival Data Storage , 2002 .

[13]  Carlos Maltzahn,et al.  RADOS: a scalable, reliable storage service for petabyte-scale storage clusters , 2007, PDSW '07.

[14]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[15]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[16]  Miguel Castro,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OPSR.

[17]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[18]  Val Henson,et al.  An Analysis of Compare-by-hash , 2003, HotOS.

[19]  Christopher Chute,et al.  The Diverse and Exploding Digital Universe , 2011 .

[20]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[21]  GhemawatSanjay,et al.  The Google file system , 2003 .