论文信息 - Optimization for data de-duplication algorithm based on file content

Optimization for data de-duplication algorithm based on file content

Content defined chunking (CDC) is a prevalent data de-duplication algorithm for removing redundant data segments in archival storage systems. Current researches on CDC do not consider the unique content characteristic of different file types, and they determine chunk boundaries in a random way and apply a single strategy for all file types. It has been proven that such method cannot achieve optimal performance for compound archival data. We analyze the content characteristic of different file types and propose candidate anchor histogram (CAH) to capture it. We propose an improved strategy for determining chunk boundaries based on CAH and tune some key parameters of CDC based on the data layout of underlying data de-duplication file system (TriDFS), which can efficiently store variable-sized chunks on fixed-sized physical blocks. These strategies are evaluated with representative archival data, and the result indicates that they can increase on average the compression ratio by 16.3% and write throughput by 13.7%, while only decrease the read throughput by 2.5%.

[1] Christos T. Karamanolis,et al. Evaluation of Efficient Archival Storage Techniques , 2004, MSST.

[2] Chun Zhang,et al. Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[3] Darrell D. E. Long. Proceedings of the Conference on File and Storage Technologies , 2002 .

[4] Hector Garcia-Molina,et al. Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[5] Sean Quinlan,et al. Venti: A New Approach to Archival Storage , 2002, FAST.

[6] Mark Lillibridge,et al. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[7] Brian D. Noble,et al. Proceedings of the 5th Symposium on Operating Systems Design and Implementation Pastiche: Making Backup Cheap and Easy , 2022 .

[8] Kai Li,et al. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[9] Michael Dahlin,et al. TAPER: tiered approach for eliminating redundancy in replica synchronization , 2005, FAST'05.

[10] Udi Manber,et al. Finding Similar Files in a Large File System , 1994, USENIX Winter.

[11] Suresh Jagannathan,et al. Improving duplicate elimination in storage systems , 2006, TOS.