论文信息 - DeDu: Building a deduplication storage system over cloud computing

DeDu: Building a deduplication storage system over cloud computing

This paper presents a deduplication storage system over cloud computing. Our deduplication storage system consists of two major components, a front-end deduplication application and Hadoop Distributed File System. Hadoop Distributed File System is common back-end distribution file system, which is used with a Hadoop database. We use Hadoop Distributed File System to build up a mass storage system and use a Hadoop database to build up a fast indexing system. With the deduplication applications, a scalable and parallel deduplicated cloud storage system can be effectively built up. We further use VMware to generate a simulated cloud environment. The simulation results demonstrate that our deduplication cloud storage system is more efficient than traditional deduplication approaches.

[1] Bin Zhou,et al. Scalable Performance of the Panasas Parallel File System , 2008, FAST.

[2] Hong Jiang,et al. MAD2: A scalable high-throughput exact deduplication approach for network backup services , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[3] Howard Gobioff,et al. The Google file system , 2003, SOSP '03.

[4] John Black,et al. Compare-by-Hash: A Reasoned Analysis , 2006, USENIX Annual Technical Conference, General Track.

[5] Gregory R. Ganger,et al. Ursa minor: versatile cluster-based storage , 2005, FAST'05.

[6] Arif Merchant,et al. FAB: building distributed enterprise disk arrays from commodity components , 2004, ASPLOS XI.

[7] Irfan Ahmad,et al. Decentralized Deduplication in SAN Cluster File Systems , 2009, USENIX Annual Technical Conference.

[8] Mark Lillibridge,et al. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[9] Wilson C. Hsieh,et al. Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[10] Chandramohan A. Thekkath,et al. Petal: distributed virtual disks , 1996, ASPLOS VII.

[11] Michal Kaczmarczyk,et al. HYDRAstor: A Scalable Secondary Storage , 2009, FAST.

[12] Sean Matthew Dorward,et al. Awarded Best Paper! - Venti: A New Approach to Archival Data Storage , 2002 .

[13] Carlos Maltzahn,et al. RADOS: a scalable, reliable storage service for petabyte-scale storage clusters , 2007, PDSW '07.

[14] Kai Li,et al. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[15] Carlos Maltzahn,et al. Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[16] Miguel Castro,et al. Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OPSR.

[17] Mark Lillibridge,et al. Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[18] Val Henson,et al. An Analysis of Compare-by-hash , 2003, HotOS.

[19] Christopher Chute,et al. The Diverse and Exploding Digital Universe , 2011 .

[20] Randy H. Katz,et al. Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[21] GhemawatSanjay,et al. The Google file system , 2003 .