SS-dedup: A high throughput stateful data routing algorithm for cluster deduplication system

As data grows exponentially within data centers, cluster deduplication storage systems face challenges in providing high throughput, high deduplication ratio and load balance. As the key technique, data routing algorithm has a strong impact on the deduplication ratio, throughput and load balance in cluster deduplication storage systems. In this paper, we propose SS-Dedup, a novel stateful data routing algorithm for cluster deduplication storage system which can achieve higher system throughput and good load balance at the cost of deduplication ratio loss and memory space in client servers. SS-Dedup takes advantage of data similarity to increases system throughput with little deduplication ratio loss. Specifically, to decrease network traffic and response time, SS-Dedup maintains LRU caches in client servers to store fingerprints of historical routed chunks for each data server. Our experiment results show that while maintaining good load balance and high deduplication ratio, SS-Dedup takes up much lower network bandwidth and provides higher system throughput.

[1]  Kai Li,et al.  Tradeoffs in Scalable Data Routing for Deduplication Clusters , 2011, FAST.

[2]  Anne-Marie Kermarrec,et al.  Probabilistic deduplication for cluster-based storage systems , 2012, SoCC '12.

[3]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[4]  Michal Kaczmarczyk,et al.  HYDRAstor: A Scalable Secondary Storage , 2009, FAST.

[5]  Hong Jiang,et al.  A Scalable Inline Cluster Deduplication Framework for Big Data Protection , 2012, Middleware.

[6]  Brian D. Noble,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Pastiche: Making Backup Cheap and Easy , 2022 .

[7]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[8]  R. Real,et al.  The Probabilistic Basis of Jaccard's Index of Similarity , 1996 .

[9]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[10]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[11]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[12]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[13]  GhemawatSanjay,et al.  The Google file system , 2003 .