An Efficient Duplicate Detection System for XML Documents

Duplicate detection, which is an important subtask of data cleaning, is the task of identifying multiple representations of a same real-world object and necessary to improve data quality. Numerous approaches both for relational and XML data exist. As XML becomes increasingly popular for data exchange and data publishing on the Web, algorithms to detect duplicates in XML documents are required. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between objects. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions. In this paper, we present the process of detecting duplicate includes three modules, such as selector, preprocessor and duplicate identifier which uses XML documents and candidate definition as input and produces duplicate objects as output. The aim of this research is to develop an efficient algorithm for detecting duplicate in complex XML documents and to reduce number of false positive by using MD5 algorithm. We illustrate the efficiency of this approach on several real-world datasets.

[1]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[2]  Felix Naumann,et al.  XML Duplicate Detection Using Sorted Neighborhoods , 2006, EDBT.

[3]  Jiawei Han,et al.  Object Matching for Information Integration: A Profiler-Based Approach , 2003, IIWeb.

[4]  Matthew A. Jaro,et al.  Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[5]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[6]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[7]  Mauricio Antonio Hernandez-Sherrington A generalization of band joins and the merge/purge problem , 1996 .

[8]  Sudipto Guha,et al.  Approximate XML joins , 2002, SIGMOD '02.

[9]  William E. Winkler,et al.  Advanced Methods For Record Linkage , 1994 .

[10]  Altigran Soares da Silva,et al.  Finding similar identities among objects from multiple web sources , 2003, WIDM '03.

[11]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[12]  Peter Fankhauser,et al.  A Precise Blocking Method for Record Linkage , 2005, DaWaK.

[13]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[14]  J StolfoSalvatore,et al.  The merge/purge problem for large databases , 1995 .

[15]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[16]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[17]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[18]  Dallan Quass,et al.  Record Linkage for Genealogical Databases , 2003 .

[19]  Yuzhou Huang Duplicate detection in XML Web data , 2009 .

[20]  Hans-Peter Kriegel,et al.  Efficient Similarity Search for Hierarchical Data in Large Databases , 2004, EDBT.

[21]  Jiawei Han,et al.  Profile-Based Object Matching for Information Integration , 2003, IEEE Intell. Syst..

[22]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[23]  Jaideep Srivastava,et al.  Entity Identification in Database Integration , 1996, Inf. Sci..

[24]  Felix Naumann,et al.  DogmatiX tracks down duplicates in XML , 2005, SIGMOD '05.

[25]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[26]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.