SORT: A Similarity-Ownership Based Routing Scheme to Improve Data Read Performance for Deduplication Clusters

Existing data routing schemes developed for deduplication clusters have never addressed the data read performance, although it has been a well-known problem that the reads require non-trivial random disk seeks significantly affecting the data read performance in deduplication systems. In this paper, we propose SORT, a Similarity-Ownership based Routing scheme that exploits both the data similarity and ownership to improve the data read performance for deduplication clusters. Our experimental results fed with real-world datasets show that SORT reduces about 10% of random disk seeks while at the cost of only 0.11% of deduplication efficiency, achieving an optimal trade off between the deduplication efficiency and data read performance compared to other existing routing schemes. This result indicates that the exploration of data ownership to data routing schemes has the potential benefit in optimizing the data read performance for deduplication clusters.

[1]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[2]  Xiaozhou Li,et al.  Reliability analysis of deduplicated and erasure-coded storage , 2011, PERV.

[3]  Suresh Jagannathan,et al.  Improving duplicate elimination in storage systems , 2006, TOS.

[4]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[5]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[6]  R.A. Ferreira,et al.  Randomized Protocols for Duplicate Elimination in Peer-to-Peer Storage Systems , 2005, IEEE Transactions on Parallel and Distributed Systems.

[7]  Michael Vrable,et al.  Cumulus: Filesystem backup to the cloud , 2009, TOS.

[8]  Kai Li,et al.  Tradeoffs in Scalable Data Routing for Deduplication Clusters , 2011, FAST.

[9]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[10]  Michal Kaczmarczyk,et al.  HYDRAstor: A Scalable Secondary Storage , 2009, FAST.

[11]  Miguel Castro,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OPSR.

[12]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[13]  Zaid A. Ali AlMarhabi,et al.  The Design and Evaluation of a Hybrid Compression Technique (HCT) for Wireless Sensor Network , 2011 .

[14]  Hong Jiang,et al.  SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput , 2011, USENIX Annual Technical Conference.

[15]  Hong Jiang,et al.  SAM: A Semantic-Aware Multi-tiered Source De-duplication Framework for Cloud Backup , 2010, 2010 39th International Conference on Parallel Processing.

[16]  Seung-Ju Jang,et al.  Design of Virtual Memory Compression System for the Embedded System , 2007, J. Convergence Inf. Technol..

[17]  Vincent Rijmen,et al.  The NIST Cryptographic Workshop on Hash Functions , 2006, IEEE Secur. Priv..

[18]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.