dedupv1: Improving deduplication throughput using solid state drives (SSD)

Data deduplication systems discover and remove redundancies between data blocks. The search for redundant data blocks is often based on hashing the content of a block and comparing the resulting hash value with already stored entries inside an index. The limited random IO performance of hard disks limits the overall throughput of such systems, if the index does not fit into main memory. This paper presents the architecture of the dedupv1 dedupli-cation system that uses solid-state drives (SSDs) to improve its throughput compared to disk-based systems. dedupv1 is designed to use the sweet spots of SSD technology (random reads and sequential operations), while avoiding random writes inside the data path. This is achieved by using a hybrid deduplication design. It is an inline deduplication system as it performs chunking and fingerprinting online and only stores new data, but it is able to delay much of the processing as well as IO operations.

[1]  André Brinkmann,et al.  Multi-level comparison of data deduplication in a backup scenario , 2009, SYSTOR '09.

[2]  Felix Hupfeld,et al.  BabuDB: Fast and Efficient File System Metadata Storage , 2010, 2010 International Workshop on Storage Network Architecture and Parallel I/Os.

[3]  Michael Isard,et al.  A design for high-performance flash disks , 2007, OPSR.

[4]  Val Henson,et al.  An Analysis of Compare-by-hash , 2003, HotOS.

[5]  Dirk Meister dedupv1 : improving deduplication throughput using solid state drives ; technical report , 2011 .

[6]  Pin Zhou,et al.  Demystifying data deduplication , 2008, Companion '08.

[7]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[8]  William J. Bolosky,et al.  Single instance storage in Windows® 2000 , 2000 .

[9]  Brian D. Noble,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Pastiche: Making Backup Cheap and Easy , 2022 .

[10]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[11]  David Mazières,et al.  A low-bandwidth network file system , 2001, SOSP.

[12]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[13]  T. Fujita tgt: Framework for Storage Target Drivers , 2010 .

[14]  Ian Pratt,et al.  Proceedings of the General Track: 2004 USENIX Annual Technical Conference , 2004 .

[15]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[16]  Guanlin Lu,et al.  ADMAD: Application-Driven Metadata Aware De-duplication Archival Storage System , 2008, 2008 Fifth IEEE International Workshop on Storage Network Architecture and Parallel I/Os.

[17]  Shmuel Tomi Klein,et al.  The design of a similarity based deduplication system , 2009, SYSTOR '09.

[18]  Fred Douglis,et al.  USENIX Association Proceedings of the General Track : 2003 USENIX Annual , 2003 .

[19]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[20]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[21]  Michael Dahlin,et al.  TAPER: tiered approach for eliminating redundancy in replica synchronization , 2005, FAST'05.

[22]  Antony I. T. Rowstron,et al.  Migrating server storage to SSDs: analysis of tradeoffs , 2009, EuroSys '09.

[23]  Christopher G. Lasater,et al.  Design Patterns , 2008, Wiley Encyclopedia of Computer Science and Engineering.

[24]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[25]  John Black,et al.  Compare-by-Hash: A Reasoned Analysis , 2006, USENIX ATC, General Track.

[26]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[27]  Daniel S. Myers,et al.  On the use of NAND flash memory in high-performance relational databases , 2008 .

[28]  Darrell D. E. Long,et al.  Deep Store: an archival storage system architecture , 2005, 21st International Conference on Data Engineering (ICDE'05).

[29]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[30]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[31]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.