Inline Data Deduplication for SSD-Based Distributed Storage

Data deduplication is used to overcome two issues on Solid State Drives (SSDs). One is price per GB of storage space, and the other is the write limit or disk endurance. By eliminating duplicate data, the deduplication system improves storage efficiency and protects SSD from unnecessary writes. CAFTL is a known solution for deduplication on SSD. We propose a system architecture for inline deduplication based on existing protocol of The Hadoop Distributed File System (HDFS), aiming at addressing performance challenges for primary storage. However, simply applying CAFTL to SSDs in a cluster does not work well. Two routing algorithms are presented and evaluated using selective real-life data sets. Compared to prior work, one routing algorithm (MMHR) may improve the deduplication ratio by 8% at minimal costs while the other (FFFR) can achieve about 30% higher deduplication ratio with tradeoff on chunk level fragmentation. A new research problem of chunk assignment into more than one node for deduplication is also formulated for more studies in this area.

[1]  Yucheng Zhang,et al.  Design Tradeoffs for Data Deduplication Performance in Backup Workloads , 2015, FAST.

[2]  Steven Swanson,et al.  The bleak future of NAND flash memory , 2012, FAST.

[3]  David Hung-Chang Du,et al.  Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[4]  João Paulo,et al.  A Survey and Classification of Storage Deduplication Systems , 2014, ACM Comput. Surv..

[5]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[6]  Kai Li,et al.  Tradeoffs in Scalable Data Routing for Deduplication Clusters , 2011, FAST.

[7]  Petros Efstathopoulos File routing middleware for cloud deduplication , 2012, CloudCP '12.

[8]  Ethan L. Miller,et al.  HANDS: A heuristically arranged non-backup in-line deduplication system , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[9]  Jian Liu,et al.  PLC-cache: Endurable SSD cache for deduplication-based primary storage , 2014, 2014 30th Symposium on Mass Storage Systems and Technologies (MSST).

[10]  Jongmoo Choi,et al.  Deduplication in SSDs: Model and quantitative analysis , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[11]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[12]  Tian Luo,et al.  CAFTL: A Content-Aware Flash Translation Layer Enhancing the Lifespan of Flash Memory based Solid State Drives , 2011, FAST.

[13]  Andrea C. Arpaci-Dusseau,et al.  Analysis of HDFS under HBase: a facebook messages case study , 2014, FAST.

[14]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[15]  Yang Zhang,et al.  Droplet: A Distributed Solution of Data Deduplication , 2012, 2012 ACM/IEEE 13th International Conference on Grid Computing.

[16]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[17]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[18]  Hong Jiang,et al.  SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput , 2011, USENIX Annual Technical Conference.

[19]  Timothy Bisson,et al.  iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[20]  GhemawatSanjay,et al.  The Google file system , 2003 .