ADMAD: Application-Driven Metadata Aware De-duplication Archival Storage System

There is a huge amount of duplicated or redundant data in current storage systems. So data de-duplication, which uses lossless data compression schemes to minimize the duplicated data at the inter-file level, has been receiving broad attention in recent years. But there are still research challenges in current approaches and storage systems, such as: how to chunking the files more efficiently and better leverage potential similarity and identity among dedicated applications; how to store the chunks effectively and reliably into secondary storage devices. In this paper, we propose ADMAD: an application-driven metadata aware de-duplication archival storage system, which makes use of certain meta-data information of different levels in the I/O path to direct the file partitioning into more meaningful data chunks (MC) to maximally reduce the inter-file level duplications. However, the chunks may be with different lengths and variable sizes, storing them into storage devices may result in a lot of fragments and involve a high percentage of random disk accesses, which is very inefficient. Therefore, in ADMAD, chunks are further packaged into fixed sized objects as the storage units to speed up the I/O performance as well as to ease the data management. Preliminary experiments have demonstrated that the proposed system can further reduce the required storage space when compared with current methods (from 20% to near 50% according to several datasets), and largely improves the writing performance (about 50%-70% in average).

[1]  Ronald Fagin,et al.  Compactly encoding unstructured inputs with differential compression , 2002, JACM.

[2]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[3]  Arnaldo de Albuquerque Araújo,et al.  Video segmentation based on 2D image analysis , 2003, Pattern Recognit. Lett..

[4]  Julian Satran,et al.  Design of the iSCSI protocol , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[5]  Joos Vandewalle,et al.  On the security of dedicated hash functions , 1998 .

[6]  David Hung-Chang Du,et al.  Performance study of iSCSI-based storage subsystems , 2003, IEEE Commun. Mag..

[7]  Mahadev Satyanarayanan,et al.  Opportunistic Use of Content Addressable Storage for Distributed File Systems , 2003, USENIX Annual Technical Conference, General Track.

[8]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[9]  Christos T. Karamanolis,et al.  Evaluation of Efficient Archival Storage Techniques , 2004, MSST.

[10]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[11]  David A. Goldberg,et al.  Design and Implementation of the Sun Network Filesystem , 1985, USENIX Conference Proceedings.

[12]  Bo Han,et al.  Implementation and Performance Evaluation of Fuzzy File Block Matching , 2007, USENIX Annual Technical Conference.

[13]  Ramin Zabih,et al.  A feature-based algorithm for detecting and classifying scene breaks , 1995, MULTIMEDIA '95.

[14]  Yongdae Kim,et al.  Experiences Building an Object-Based Storage System based on the OSD T-10 Standard , 2005 .

[15]  Darrell D. E. Long,et al.  Providing High Reliability in a Minimum Redundancy Archival Storage System , 2006, 14th IEEE International Symposium on Modeling, Analysis, and Simulation.

[16]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[17]  Kanishk Jain Object-based Storage , 2022 .

[18]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.