论文信息 - Duplicate Management for Reference Data

Duplicate Management for Reference Data

Recent studies show that reference or fixed content data accounts for more than half of all newly created digital data, and is growing rapidly. Reference data is characterized by enormous quantities of largely similar data and very long retention periods. Their secure retention and eventual destruction are increasingly regulated by government agencies as more and more critical data are stored electronically and are vulnerable to unauthorized destruction and tampering. In this paper, we describe a storage system optimized for reference data. The system manages unique chunks of data to reliably and efficiently store large amounts of similar data, and to allow selected data to be efficiently shredded. We discuss ways to detect duplicate data, describing a sliding blocking method that greatly outperforms other methods. We also present practical ways to organize the metadata for the unique chunks, allowing most of it to be kept on disk and to be effectively prefetched when needed. Since electronic mail (email) is an important storage-intensive instance of reference data and is currently the intense focus of regulatory bodies, we use email as a sample application and analyze its storage characteristics in detail. We find that more than 30% of the blocks in an email data set are duplicates and that a duplicate block is most likely to occur within a few days of its previous occurrence. Our analysis further indicates that the effects of duplicate block elimination and compression techniques such as block gzip seem to be ∗Work conducted at IBM Almaden Research Center, San Jose, CA as a summer intern project in 2003. relatively independent so that they can be combined to achieve additive results.

Windsor W. Hsu | Timothy E. Denehy | W. Hsu

[1] Peter Deutsch,et al. GZIP file format specification version 4.3 , 1996, RFC.

[2] Erez Zadok,et al. Extending File Systems Using Stackable Templates , 1999, USENIX Annual Technical Conference, General Track.

[3] T. Mustelin,et al. The next wave: protein tyrosine phosphatases enter T cell antigen receptor signalling. , 1999, Cellular signalling.

[4] Timothy L. Harris,et al. Storage, Mutability and Naming in Pasta , 2002, NETWORKING Workshops.

[5] Yale N. Patt,et al. Soft updates: a solution to the metadata update problem in file systems , 2000 .

[6] Paul Mackerras,et al. The rsync algorithm , 1996 .

[7] Sean Matthew Dorward,et al. Awarded Best Paper! - Venti: A New Approach to Archival Data Storage , 2002 .

[8] Fred Douglis,et al. USENIX Association Proceedings of the General Track : 2003 USENIX Annual , 2003 .

[9] Abraham Lempel,et al. A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[10] A. Broder. Some applications of Rabin’s fingerprinting method , 1993 .