Evaluating the usefulness of content addressable storage for high-performance data intensive applications

Content Addressable Storage (CAS) is a data representation technique that operates by partitioning a given data-set into non-intersecting units called chunks and then employing techniques to efficiently recognize chunks occurring multiple times. This allows CAS to eliminate duplicate instances of such chunks, resulting in reduced storage space compared to conventional representations of data. CAS is an attractive technique for reducing the storage and network bandwidth needs of performance-sensitive, data-intensive applications in a variety of domains. These include enterprise applications, Web-based e-commerce or entertainment services and highly parallel scientific/engineering applications and simulations, to name a few. In this paper, we conduct an empirical evaluation of the benefits offered by CAS to a variety of real-world data-intensive applications. The savings offered by CAS depend crucially on (i) the nature of the data-set itself and (ii) the chunk-size that CAS employs. We investigate the impact of both these factors on disk space savings, savings in network bandwidth, and error resilience of data. We find that a chunk-size of 1 KB can provide up to 84% savings in disk space and even higher savings in network bandwidth whilst trading off error resilience and incurring 14% CAS related overheads. Drawing upon lessons learned from our study, we provide insights on (i) the choice of the chunk-size for effective space savings and (ii) the use of selective data replication to counter the loss of error resilience caused by CAS.

[1]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[2]  Timothy L. Harris,et al.  Storage, Mutability and Naming in Pasta , 2002, NETWORKING Workshops.

[3]  Mahadev Satyanarayanan,et al.  Design Tradeoffs in Applying Content Addressable Storage to Enterprise-scale Systems Based on Virtual Machines , 2006, USENIX Annual Technical Conference, General Track.

[4]  Ronald Fagin,et al.  Compactly encoding unstructured inputs with differential compression , 2002, JACM.

[5]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[6]  Terence Kelly,et al.  Design, Implementation, and Evaluation of Duplicate Transfer Detection in HTTP , 2004, NSDI.

[7]  Anand Sivasubramaniam,et al.  Providing tunable consistency for a parallel file store , 2005, FAST'05.

[8]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[9]  Robert Tappan Morris,et al.  Ivy: a read/write peer-to-peer file system , 2002, OSDI '02.

[10]  Mahadev Satyanarayanan,et al.  Opportunistic Use of Content Addressable Storage for Distributed File Systems , 2003, USENIX Annual Technical Conference, General Track.

[11]  Calvin Chan,et al.  CMPUT690 Term Project Fingerprinting using Polynomial (Rabin's method) , 2001 .

[12]  Marvin Theimer,et al.  Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs , 2000, SIGMETRICS '00.

[13]  Darrell D. E. Long,et al.  Deep Store: an archival storage system architecture , 2005, 21st International Conference on Data Engineering (ICDE'05).

[14]  Andrew Tridgell,et al.  Efficient Algorithms for Sorting and Synchronization , 1999 .

[15]  Torsten Suel,et al.  Improved file synchronization techniques for maintaining large replicated collections over slow networks , 2004, Proceedings. 20th International Conference on Data Engineering.

[16]  GhemawatSanjay,et al.  The Google file system , 2003 .

[17]  Marvin Theimer,et al.  Reclaiming space from duplicate files in a serverless distributed file system , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[18]  Sean Quinlan Fossil, an Archival File Server , 2011 .

[19]  Eric A. Brewer,et al.  Value-based web caching , 2003, WWW '03.

[20]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[21]  John Black,et al.  Compare-by-Hash: A Reasoned Analysis , 2006, USENIX Annual Technical Conference, General Track.

[22]  Darrell D. E. Long,et al.  Providing High Reliability in a Minimum Redundancy Archival Storage System , 2006, 14th IEEE International Symposium on Modeling, Analysis, and Simulation.

[23]  A. Broder Some applications of Rabin’s fingerprinting method , 1993 .

[24]  Michael Dahlin,et al.  TAPER: tiered approach for eliminating redundancy in replica synchronization , 2005, FAST'05.

[25]  Dennis Shasha,et al.  Secure Untrusted Data Repository (SUNDR) , 2004, OSDI.

[26]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.

[27]  Val Henson,et al.  An Analysis of Compare-by-hash , 2003, HotOS.

[28]  William J. Bolosky,et al.  Single instance storage in Windows® 2000 , 2000 .

[29]  Mahadev Satyanarayanan,et al.  Internet suspend/resume , 2002, Proceedings Fourth IEEE Workshop on Mobile Computing Systems and Applications.

[30]  Mahadev Satyanarayanan,et al.  Integrating Portable and Distributed Storage , 2004, FAST.

[31]  William J. Bolosky,et al.  Single Instance Storage in Windows , 2000 .

[32]  Brian D. Noble,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Pastiche: Making Backup Cheap and Easy , 2022 .

[33]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.