Online data deduplication for in-memory big-data analytic systems

Given a set of files that show a certain degree of similarity, we consider a novel problem of performing data redundancy elimination across a set of distributed worker nodes in a shared-nothing in-memory big data analytic system. The redundancy elimination scheme is designed in a manner that is: (i) space-efficient: the total space needed to store the files is minimized and, (ii) access-isolation: data shuffling among server is also minimized. In this paper, we first show that finding an access-efficient and space optimal solution is an NP-Hard problem. Following this, we present the file partitioning algorithms that locate access-efficient solutions in an incremental manner with minimal algorithm time complexity (polynomial time). Our experimental verification on multiple data sets confirms that the proposed file partitioning solution is able to achieve compression ratio close to the optimal compression performance achieved by a centralized solution.

[1]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[2]  Marvin Theimer,et al.  Reclaiming space from duplicate files in a serverless distributed file system , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[3]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[4]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[5]  Mung Chiang,et al.  SAP: Similarity-aware partitioning for efficient cloud storage , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[6]  Dutch T. Meyer,et al.  A study of practical deduplication , 2011, TOS.

[7]  Jinyang Li,et al.  Piccolo: Building Fast, Distributed Programs with Partitioned Tables , 2010, OSDI.

[8]  Irfan Ahmad,et al.  Decentralized Deduplication in SAN Cluster File Systems , 2009, USENIX Annual Technical Conference.

[9]  Sean Matthew Dorward,et al.  Awarded Best Paper! - Venti: A New Approach to Archival Data Storage , 2002 .

[10]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[11]  Nasir D. Memon,et al.  Cluster-based delta compression of a collection of files , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[12]  Udi Manber,et al.  GLIMPSE: A Tool to Search Through Entire File Systems , 1994, USENIX Winter.

[13]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[14]  Torsten Suel,et al.  Compressing File Collections with a TSP-Based Approach , 2004 .

[15]  George Forman,et al.  Finding similar files in large document repositories , 2005, KDD '05.

[16]  Micah Adler,et al.  Towards compressing Web graphs , 2001, Proceedings DCC 2001. Data Compression Conference.

[17]  Darrell D. E. Long,et al.  Duplicate Data Elimination in a SAN File System , 2004, MSST.

[18]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[19]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[20]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.