Efficient archival data storage

The increasing amounts of data that are created and that must be stored continues to grow. Archival storage systems must retain large volumes of data reliably over long periods of time at a low cost. Archival storage requirements and the type of stored vary widely, from being highly compressed to highly redundant. The ever-increasing volume of archival data that need to be retained for long periods of time has motivated the design of low-cost, high-efficiency storage systems. Due to economic factors, such as the rapidly decreasing cost of disk storage, memory and processing---as well as improvements in technology, such as increased magnetic storage densities, research and development have moved toward disk-based archival storage. To further lower cost, they eliminate redundancy using inter-file and intra-file data compression. Each system uses a compression method but no system compresses data consistently better than all efficient storage methods. Our main contribution, presented in this dissertation, is to prove the thesis that it is possible to create a scalable archival storage system that efficiently stores diverse data by progressively applying large-scale data compression, providing better space efficiency than any single existing method. To support this, our work identifies common properties in these systems, evaluates efficient storage methods with respect to these properties, and presents a model for expected space and time behavior. In addition, we have developed a prototype storage system using a Progressive R edundancy Elimination of Similar and Identical Data In Objects (PRESIDIO) framework. Similar and identical files are detected by the PRE algorithm. Data is recorded using a virtual content-addressable storage (VCAS) mechanism that can be used to store content with hybrid inter-file compression methods. This work is a key part of the Deep Store archival storage architecture, a large-scale storage system that stores immutable data efficiently and reliably for long periods of time over a cluster of nodes that record data to disk.

[1]  Larry Carter,et al.  Universal classes of hash functions (Extended Abstract) , 1977, STOC '77.

[2]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[3]  Walter F. Tichy,et al.  The string-to-string correction problem with block moves , 1984, TOCS.

[4]  Ronald Fagin,et al.  Compactly encoding unstructured inputs with differential compression , 2002, JACM.

[5]  Witold Litwin,et al.  LH* - Linear Hashing for Distributed Files , 1993, SIGMOD Conference.

[6]  Mark Nelson,et al.  The Data Compression Book , 2009 .

[7]  Alexander S. Szalay,et al.  TeraScale SneakerNet: Using Inexpensive Disks for Backup, Archiving, and Data Exchange , 2002, ArXiv.

[8]  Hugh E. Williams,et al.  A general-purpose compression scheme for large collections , 2002, TOIS.

[9]  Torsten Suel,et al.  Compressing File Collections with a TSP-Based Approach , 2004 .

[10]  J. W. Hunt,et al.  An Algorithm for Differential File Comparison , 2008 .

[11]  Witold Litwin,et al.  Algebraic signatures for scalable distributed data structures , 2004, Proceedings. 20th International Conference on Data Engineering.

[12]  Jeff Rothenberg,et al.  Ensuring the Longevity of Digital Documents , 1995 .

[13]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[14]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[15]  David Wetherall,et al.  A protocol-independent technique for eliminating redundant network traffic , 2000, SIGCOMM.

[16]  Darrell D. E. Long,et al.  Duplicate Data Elimination in a SAN File System , 2004, MSST.

[17]  Randal C. Burns,et al.  In-place reconstruction of delta compressed files , 1998, PODC '98.

[18]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[19]  Timo Burkard,et al.  Herodotus: A Peer-to-Peer Web Archival System , 2002 .

[20]  Norman C. Hutchinson,et al.  Deciding when to forget in the Elephant file system , 1999, SOSP.

[21]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[22]  Hai Jin,et al.  Disk System Architectures for High Performance Computing , 2002 .

[23]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[24]  Andrew Tridgell,et al.  Efficient Algorithms for Sorting and Synchronization , 1999 .

[25]  Andrew V. Goldberg,et al.  A prototype implementation of archival Intermemory , 1999, DL '99.

[26]  Yasushi Saito,et al.  Pangaea: a symbiotic wide-area file system , 2002, EW 10.

[27]  Fazli Can,et al.  Incremental clustering for dynamic information processing , 1993, TOIS.

[28]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[29]  Ethan L. Miller,et al.  Long-term File Activity and Inter-Reference Patterns (CMG Paper # 2041) , 1998 .

[30]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[31]  Hector Garcia-Molina,et al.  Building a scalable and accurate copy detection mechanism , 1996, DL '96.

[32]  Witold Litwin,et al.  LH*—a scalable, distributed data structure , 1996, TODS.

[33]  Ronald L. Rivest,et al.  The MD4 Message-Digest Algorithm , 1990, RFC.

[34]  James Lau,et al.  File System Design for an NFS File Server Appliance , 1994, USENIX Winter.

[35]  Torsten Suel,et al.  zdelta: An efficient delta compression tool , 2002 .

[36]  Peter F. Corbett,et al.  Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction , 2004 .

[37]  Randal C. Burns DIFFERENTIAL COMPRESSION: A GENERALIZED SOLUTION FOR BINARY FILES , 1996 .

[38]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[39]  Gregory R. Ganger,et al.  Ursa minor: versatile cluster-based storage , 2005, FAST'05.

[40]  Antony I. T. Rowstron,et al.  Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility , 2001, SOSP.

[41]  Val Henson,et al.  An Analysis of Compare-by-hash , 2003, HotOS.

[42]  Ekow J. Otoo,et al.  Balanced multidimensional extendible hash tree , 1985, PODS.

[43]  Witold Litwin,et al.  High-availability LH* schemes with mirroring , 1996, Proceedings First IFCIS International Conference on Cooperative Information Systems.

[44]  Udi Manber,et al.  Integrating content-based access mechanisms with hierarchical file systems , 1999, OSDI '99.

[45]  Ben Y. Zhao,et al.  Silverback: A Global-Scale Archival System , 2001 .

[46]  Randal C. Burns,et al.  Efficient distributed backup with delta compression , 1997, IOPADS '97.

[47]  Christos T. Karamanolis,et al.  Evaluation of Efficient Archival Storage Techniques , 2004, MSST.

[48]  Joshua P. MacDonald,et al.  File System Support for Delta Compression , 2000 .

[49]  Darrell D. E. Long,et al.  Design and Implementation of a Predictive File Prefetching Algorithm , 2001, USENIX Annual Technical Conference, General Track.

[50]  Kai Li,et al.  Image similarity search with compact data structures , 2004, CIKM '04.

[51]  Herbert Bos,et al.  File size distribution on UNIX systems: then and now , 2006, OPSR.

[52]  C. M. Riggle,et al.  Design of error correction systems for disk drives , 1998 .

[53]  Nasir D. Memon,et al.  Cluster-based delta compression of a collection of files , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[54]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[55]  Ethan L. Miller,et al.  Long-term unix file system activity and the efficacy of automatic file migration , 1998 .

[56]  Darrell D. E. Long,et al.  Deep Store: an archival storage system architecture , 2005, 21st International Conference on Data Engineering (ICDE'05).

[57]  Chaitanya K. Baru,et al.  Collection-Based Persistent Digital Archives - Part 2 , 2000, D Lib Mag..

[58]  John Kubiatowicz,et al.  Erasure Coding Vs. Replication: A Quantitative Comparison , 2002, IPTPS.

[59]  Margo I. Seltzer,et al.  A New Hashing Package for UNIX , 1991, USENIX Winter.

[60]  Chaitanya K. Baru,et al.  Collection-Based Persistent Digital Archives - Part 1 , 2000, D Lib Mag..

[61]  Walter A. Burkhard,et al.  Some approaches to best-match file searching , 1973, Commun. ACM.

[62]  Ronald L. Rivest,et al.  The MD5 Message-Digest Algorithm , 1992, RFC.

[63]  Christoph Reichenberger,et al.  Delta storage for arbitrary non-text files , 1991, SCM '91.

[64]  Walter F. Tichy,et al.  Delta algorithms: an empirical analysis , 1998, TSEM.

[65]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.

[66]  Elwyn R. Berlekamp,et al.  Algebraic coding theory , 1984, McGraw-Hill series in systems science.

[67]  M. Narasimha Murty,et al.  A computationally efficient technique for data-clustering , 1980, Pattern Recognit..

[68]  Prashant J. Shenoy,et al.  Rules of thumb in data engineering , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[69]  Darren R. Hardy,et al.  Essence: A Resource Discovery System Based on Semantic File Indexing , 1993, USENIX Winter.

[70]  Udi Manber,et al.  GLIMPSE: A Tool to Search Through Entire File Systems , 1994, USENIX Winter.

[71]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[72]  Éric Fimbel Edit distance and chaitin-kolmogorov difference , 2002 .

[73]  Miguel Castro,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OPSR.

[74]  Hector Garcia-Molina,et al.  Archival storage for digital libraries , 1998, DL '98.

[75]  Fred Douglis,et al.  USENIX Association Proceedings of the General Track : 2003 USENIX Annual , 2003 .

[76]  Magnus Karlsson,et al.  Taming aggressive replication in the Pangaea wide-area file system , 2002, OPSR.

[77]  Darrell D. E. Long,et al.  Swift: Using Distributed Disk Striping to Provide High I/O Data Rates , 1991, Comput. Syst..

[78]  Margo I. Seltzer,et al.  Structure and Performance of the Direct Access File System , 2002, USENIX ATC, General Track.

[79]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[80]  H. Samet,et al.  Incremental Similarity Search in Multimedia Databases , 2000 .

[81]  Anne E. Trefethen,et al.  The Data Deluge: An e-Science Perspective , 2003 .

[82]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[83]  Darrell D. E. Long,et al.  A linear time, constant space differencing algorithm , 1997, 1997 IEEE International Performance, Computing and Communications Conference.

[84]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[85]  Colin Percival Naı̈ve Differences of Executable Code , 2003 .

[86]  Craig A. N. Soules,et al.  Connections: using context to enhance file search , 2005, SOSP '05.

[87]  Chandramohan A. Thekkath,et al.  Frangipani: a scalable distributed file system , 1997, SOSP.

[88]  Daniel J. Rosenkrantz,et al.  A linear-time scheme for version reconstruction , 1994, TOPL.

[89]  Michael O. Rabin,et al.  Probabilistic Algorithms in Finite Fields , 1980, SIAM J. Comput..

[90]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[91]  Arkady B. Zaslavsky,et al.  Signature Extraction for Overlap Detection in Documents , 2002, ACSC.

[92]  Andrew V. Goldberg,et al.  Towards an archival Intermemory , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[93]  Zhichen Xu,et al.  Towards a semantic, deep archival file system , 2003, The Ninth IEEE Workshop on Future Trends of Distributed Computing Systems, 2003. FTDCS 2003. Proceedings..

[94]  P. Sarbanes,et al.  Sarbanes-Oxley Act of 2002 , 2002 .

[95]  Aris M. Ouksel,et al.  Storage mappings for multidimensional linear dynamic hashing , 1983, PODS.

[96]  Krishna Bharat,et al.  The Term Vector Database: fast access to indexing terms for Web pages , 2000, Comput. Networks.

[97]  Walter F. Tichy,et al.  Rcs — a system for version control , 1985, Softw. Pract. Exp..

[98]  Timothy L. Harris,et al.  Storage, Mutability and Naming in Pasta , 2002, NETWORKING Workshops.

[99]  Rajeev Motwani,et al.  Incremental clustering and dynamic information retrieval , 1997, STOC '97.

[100]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[101]  Peter K. Pearson,et al.  Fast hashing of variable-length text strings , 1990, CACM.

[102]  Michael O. Rabin,et al.  Efficient dispersal of information for security, load balancing, and fault tolerance , 1989, JACM.

[103]  Jeannette M. Wing,et al.  Verifiable secret redistribution for archive systems , 2002, First International IEEE Security in Storage Workshop, 2002. Proceedings..

[104]  Reagan Moore,et al.  Configuring and tuning archival storage systems , 1999, 16th IEEE Symposium on Mass Storage Systems in cooperation with the 7th NASA Goddard Conference on Mass Storage Systems and Technologies (Cat. No.99CB37098).

[105]  GhemawatSanjay,et al.  The Google file system , 2003 .

[106]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[107]  David Eppstein,et al.  Fast hierarchical clustering and other applications of dynamic closest pairs , 1999, SODA '98.

[108]  Chandramohan A. Thekkath,et al.  Petal: distributed virtual disks , 1996, ASPLOS VII.

[109]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[110]  Anja Feldmann,et al.  Potential benefits of delta encoding and data compression for HTTP , 1997, SIGCOMM '97.

[111]  John T. Kohl,et al.  HighLight: Using a Log-structured File System for Tertiary Storage Management , 1993, USENIX Winter.

[112]  Kave Eshghi Intrinsic references in distributed systems , 2002, Proceedings 22nd International Conference on Distributed Computing Systems Workshops.

[113]  Ian Pratt,et al.  Proceedings of the General Track: 2004 USENIX Annual Technical Conference , 2004 .

[114]  Zhichen Xu,et al.  PeerSearch: Efficient Information Retrieval in Peer-to-Peer Networks , 2002 .

[115]  David G. Korn,et al.  Engineering a Differencing and Compression Data Format , 2002, USENIX Annual Technical Conference, General Track.

[116]  Brian D. Noble,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Pastiche: Making Backup Cheap and Easy , 2022 .

[117]  Khalid Sayood Lossless Compression Handbook , 2003 .

[118]  Pierre Jouvelot,et al.  Semantic file systems , 1991, SOSP '91.

[119]  Darrell D. E. Long,et al.  Experimentally Evaluating In-Place Delta Reconstruction , 2002 .

[120]  Ronitt Rubinfeld,et al.  A sublinear algorithm for weakly approximating edit distance , 2003, STOC '03.

[121]  Keishi Tajima,et al.  Archiving scientific data , 2004, TODS.

[122]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[123]  Dengguo Feng,et al.  Collisions for Hash Functions MD4, MD5, HAVAL-128 and RIPEMD , 2004, IACR Cryptol. ePrint Arch..

[124]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[125]  Scott A. Brandt,et al.  Reliability mechanisms for very large storage systems , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[126]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[127]  Jeffrey C. Mogul,et al.  The VCDIFF Generic Differencing and Compression Data Format , 2002, RFC.

[128]  A. Broder Some applications of Rabin’s fingerprinting method , 1993 .

[129]  Dan Suciu,et al.  XMill: an efficient compressor for XML data , 2000, SIGMOD '00.

[130]  Richard N. Tucker THE DOMESDAY PROJECT , 1989 .

[131]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[132]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[133]  Michael A. Olson,et al.  The Design and Implementation of the Inversion File System , 1993, USENIX Winter.

[134]  Ethan L. Miller,et al.  Using content-derived names for configuration management , 1997, SSR '97.

[135]  Hector Garcia-Molina,et al.  Finding near-replicas of documents on the Web , 1999 .