Deduplication in Databases using Locality Sensitive Hashing and Bloom filter

Duplicates in databases represent today an important data quality challenge which leads to bad decisions. Deduplication is a capacity optimization technology that is being used to dramatically improve storage efficiency. In large databases, sometimes find ourselves with tens of thousands of duplicates, which necessitates an automatic deduplication. It can reduce the amount of storage cost by eliminating duplicate data copies. In proposed system, introduce an effective duplicate detection method for automatic deduplication of text files and repeated strings. In this paper, we propose a similarity-based data deduplication scheme by integrating the technologies of bloom filter and Locality Sensitive hashing (LSH), which can significantly reduce the computation overhead by only performing deduplication operations for similar texts. In the proposed system check the strings or text in the repository that are similar. If they are similar, then remove the string and maintain only one copy of the data. Locality Sensitive Hashing and bloom filter methods provide better results than those of known methods, with a lesser complexity. IndexTerms Deduplication, Bloom filter, Levenshtein Distance.

[1]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .

[2]  Marvin Theimer,et al.  Reclaiming space from duplicate files in a serverless distributed file system , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[3]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[4]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[5]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[6]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[7]  Jian Shen,et al.  Secure similarity-based cloud data deduplication in Ubiquitous city , 2017, Pervasive Mob. Comput..

[8]  Ibrahim Moukouop Nguena,et al.  Fast Semantic Duplicate Detection Techniques in Databases , 2017 .

[9]  Kadhum Alnoory,et al.  Performance Evaluation of Similarity Functions for Duplicate Record Detection , 2011 .

[10]  C. Lee Giles,et al.  Adaptive sorted neighborhood methods for efficient record linkage , 2007, JCDL '07.

[11]  Keizo Oyama,et al.  A Fast Linkage Detection Scheme for Multi-Source Information Integration , 2005, International Workshop on Challenges in Web Information Retrieval and Integration.

[12]  James Allan,et al.  Using Soundex Codes for Indexing Names in ASR Documents , 2004, HLT-NAACL 2004.

[13]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[14]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[15]  Carlos Alberto Heuser,et al.  A fast approach for parallel deduplication on multicore processors , 2011, SAC '11.

[16]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .