WAN-optimized replication of backup datasets using stream-informed delta compression

Replicating data off site is critical for disaster recovery reasons, but the current approach of transferring tapes is cumbersome and error prone. Replicating across a wide area network (WAN) is a promising alternative, but fast network connections are expensive or impractical in many remote locations, so improved compression is needed to make WAN replication truly practical. We present a new technique for replicating backup datasets across a WAN that not only eliminates duplicate regions of files (deduplication) but also compresses similar regions of files with delta compression, which is available as a feature of EMC Data Domain systems. Our main contribution is an architecture that adds stream-informed delta compression to already existing deduplication systems and eliminates the need for new, persistent indexes. Unlike techniques based on knowing a file's version or that use a memory cache, our approach achieves delta compression across all data replicated to a server at any time in the past. From a detailed analysis of datasets and statistics from hundreds of customers using our product, we achieve an additional 2X compression from delta compression beyond deduplication and local compression, which enables customers to replicate data that would otherwise fail to complete within their backup window.

[1]  David Mazières,et al.  A low-bandwidth network file system , 2001, SOSP.

[2]  David Wetherall,et al.  A protocol-independent technique for eliminating redundant network traffic , 2000, SIGCOMM.

[3]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[4]  Philip Shilane,et al.  Delta Compressed and Deduplicated Storage Using Stream-Informed Locality , 2012, HotStorage.

[5]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[6]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[7]  Steve R. Kleiman,et al.  SnapMirror: File-System-Based Asynchronous Mirroring for Disaster Recovery , 2002, FAST.

[8]  Ian Pratt,et al.  Proceedings of the General Track: 2004 USENIX Annual Technical Conference , 2004 .

[9]  ShilanePhlip,et al.  WAN-optimized replication of backup datasets using stream-informed delta compression , 2012 .

[10]  Torsten Suel,et al.  zdelta: An efficient delta compression tool , 2002 .

[11]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[12]  Jin Li,et al.  ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory , 2010, USENIX Annual Technical Conference.

[13]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[14]  David J. Lilja,et al.  Characterizing datasets for data deduplication in backup applications , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[15]  David Wetherall,et al.  A protocol-independent technique for eliminating redundant network traffic , 2000, SIGCOMM 2000.

[16]  Hong Jiang,et al.  SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput , 2011, USENIX Annual Technical Conference.

[17]  Mark Lillibridge,et al.  Jumbo Store: Providing Efficient Incremental Upload and Versioning for a Utility Rendering Service , 2007, FAST.

[18]  Walter F. Tichy,et al.  Delta algorithms: an empirical analysis , 1998, TSEM.

[19]  Torsten Suel,et al.  Algorithms for Delta Compression and Remote File Synchronization , 2003 .

[20]  Michael Dahlin,et al.  TAPER: tiered approach for eliminating redundancy in replica synchronization , 2005, FAST'05.

[21]  Andrew Tridgell,et al.  Efficient Algorithms for Sorting and Synchronization , 1999 .

[22]  Torsten Suel,et al.  Improved file synchronization techniques for maintaining large replicated collections over slow networks , 2004, Proceedings. 20th International Conference on Data Engineering.

[23]  Yan Chen,et al.  Data redundancy and compression methods for a disk-based network backup system , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[24]  Randal C. Burns,et al.  Efficient distributed backup with delta compression , 1997, IOPADS '97.

[25]  Christos T. Karamanolis,et al.  Evaluation of Efficient Archival Storage Techniques , 2004, MSST.

[26]  Joshua P. MacDonald,et al.  File System Support for Delta Compression , 2000 .

[27]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[28]  Kai Li,et al.  Tradeoffs in Scalable Data Routing for Deduplication Clusters , 2011, FAST.

[29]  KyoungSoo Park,et al.  Supporting Practical Content-Addressable Caching with CZIP Compression , 2007, USENIX Annual Technical Conference.

[30]  Anja Feldmann,et al.  Potential benefits of delta encoding and data compression for HTTP , 1997, SIGCOMM '97.

[31]  Fred Douglis,et al.  Characteristics of backup workloads in production systems , 2012, FAST.

[32]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.

[33]  Youjip Won,et al.  Efficient Deduplication Techniques for Modern Backup Operation , 2011, IEEE Transactions on Computers.

[34]  Petros Efstathopoulos,et al.  Building a High-performance Deduplication System , 2011, USENIX Annual Technical Conference.

[35]  Fred Douglis,et al.  USENIX Association Proceedings of the General Track : 2003 USENIX Annual , 2003 .

[36]  K. Gopinath,et al.  PRESIDIO: A Framework for Efficient Archival Data Storage , 2011, TOS.

[37]  Suresh Jagannathan,et al.  Improving duplicate elimination in storage systems , 2006, TOS.

[38]  Mun Choon Chan,et al.  Cache-based compaction: a new technique for optimizing Web transfer , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[39]  Shmuel Tomi Klein,et al.  The design of a similarity based deduplication system , 2009, SYSTOR '09.

[40]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[41]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[42]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.