Implementation of a deduplication cache mechanism using content-defined chunking

Many application programs in data-intensive science read and write large files. Large data consume significant memory because the data is loaded into the page cache. Since memory resources are critically valuable in data-intensive computing, reducing the memory footprint consumed by file data is essential. In this paper, we propose a cache deduplication mechanism with content-defined chunking CDC for the Gfarm distributed file system. CDC divides a file into variable-size blocks chunks based on the contents of the file. The client stores the chunks in the local file system as cache files and reuses them during subsequent file accesses. Deduplication of chunks reduces the amount of transmitted data between clients and servers, and reduces storage and memory requirements. The experimental results demonstrate that the proposed mechanism significantly improves the performance of file-read operations and that the introduction of parallelism reduces the overhead of file-write operations.

[1]  Limin Xiao,et al.  An optimal candidate selection model for self-acting load balancing of parallel file system , 2012, Int. J. High Perform. Comput. Netw..

[2]  Ioan Raicu,et al.  HyCache+: Towards Scalable High-Performance Caching Middleware for Parallel File Systems , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[3]  Petros Efstathopoulos,et al.  Building a High-performance Deduplication System , 2011, USENIX Annual Technical Conference.

[4]  Hiroshi Matsuo,et al.  Reducing the Load of Metadata Server by Changing Cache Policy Dynamically in Distributed File System , 2013, 2013 First International Symposium on Computing and Networking.

[5]  William J. Bolosky,et al.  Single Instance Storage in Windows , 2000 .

[6]  Ki-Woong Park,et al.  GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system , 2012, PMAM '12.

[7]  Anand Sivasubramaniam,et al.  Evaluating the usefulness of content addressable storage for high-performance data intensive applications , 2008, HPDC '08.

[8]  Andrew J. Hutton,et al.  Lustre: Building a File System for 1,000-node Clusters , 2003 .

[9]  Dutch T. Meyer,et al.  A study of practical deduplication , 2011, TOS.

[10]  K. G. Srinivasa,et al.  GRFM: an efficient grid-based replication and fault tolerant middleware , 2013, Int. J. Comput. Sci. Eng..

[11]  Philip Shilane,et al.  WAN-optimized replication of backup datasets using stream-informed delta compression , 2012, TOS.

[12]  Jin Li,et al.  ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory , 2010, USENIX Annual Technical Conference.

[13]  Osamu Tatebe,et al.  Parallel and Distributed Astronomical Data Analysis on Grid Datafarm , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[14]  João Paulo,et al.  A Survey and Classification of Storage Deduplication Systems , 2014, ACM Comput. Surv..

[15]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.

[16]  Osamu Tatebe,et al.  Optimizing Local File Accesses for FUSE-Based Distributed Storage , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[17]  Srikanth Kandula,et al.  PACMan: Coordinated Memory Caching for Parallel Jobs , 2012, NSDI.

[18]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[19]  David Mazières,et al.  A low-bandwidth network file system , 2001, SOSP.

[20]  Osamu Tatebe,et al.  Gfarm Grid File System , 2010, New Generation Computing.

[21]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[22]  Osamu Tatebe,et al.  Pwrake: a parallel and distributed flexible workflow management tool for wide-area data intensive computing , 2010, HPDC '10.

[23]  Limin Xiao,et al.  HCCache: A Hybrid Client-Side Cache Management Scheme for I/O-intensive Workloads in Network-Based File Systems , 2012, 2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies.

[24]  Raju Rangaswami,et al.  I/O Deduplication: Utilizing content similarity to improve I/O performance , 2010, TOS.

[25]  Mitsuhisa Sato,et al.  High performance data analysis for particle physics using the Gfarm file system , 2006, SC.

[26]  Peter Arzberger,et al.  PROTEOME ANALYSIS USING IGAP IN GFARM , 2006 .

[27]  Ian Pratt,et al.  Proceedings of the General Track: 2004 USENIX Annual Technical Conference , 2004 .

[28]  André Brinkmann,et al.  A study on data deduplication in HPC storage systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Masahiro Tanaka,et al.  Agile parallel bioinformatics workflow management using Pwrake , 2011, BMC Research Notes.

[30]  Hong Jiang,et al.  P-Dedupe: Exploiting Parallelism in Data Deduplication System , 2012, 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage.

[31]  Hong Jiang,et al.  SAR: SSD Assisted Restore Optimization for Deduplication-Based Storage Systems in the Cloud , 2012, 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage.

[32]  Dean Hildebrand,et al.  Panache: A Parallel File System Cache for Global File Access , 2010, FAST.

[33]  Kenjiro Taura,et al.  File-access patterns of data-intensive workflow applications and their implications to distributed filesystems , 2010, HPDC '10.

[34]  Timothy Bisson,et al.  iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[35]  Anthony J. G. Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery [Point of View] , 2011 .

[36]  Irfan Ahmad,et al.  Decentralized Deduplication in SAN Cluster File Systems , 2009, USENIX Annual Technical Conference.

[37]  Youjip Won,et al.  Buffered FUSE: optimising the Android IO stack for user-level filesystem , 2014, Int. J. Embed. Syst..

[38]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[39]  Akshat Verma,et al.  Shredder: GPU-accelerated incremental storage and computation , 2012, FAST.

[40]  Margo I. Seltzer,et al.  Flash Caching on the Storage Client , 2013, USENIX Annual Technical Conference.

[41]  Mark Lillibridge,et al.  Improving restore speed for backup systems that use inline chunk-based deduplication , 2013, FAST.