论文信息 - An Efficient Duplicate Detection System for XML Documents

An Efficient Duplicate Detection System for XML Documents

Duplicate detection, which is an important subtask of data cleaning, is the task of identifying multiple representations of a same real-world object and necessary to improve data quality. Numerous approaches both for relational and XML data exist. As XML becomes increasingly popular for data exchange and data publishing on the Web, algorithms to detect duplicates in XML documents are required. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between objects. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions. In this paper, we present the process of detecting duplicate includes three modules, such as selector, preprocessor and duplicate identifier which uses XML documents and candidate definition as input and produces duplicate objects as output. The aim of this research is to develop an efficient algorithm for detecting duplicate in complex XML documents and to reduce number of false positive by using MD5 algorithm. We illustrate the efficiency of this approach on several real-world datasets.

Thi Thi Soe Nyunt | Thandar Lwin | T. Nyunt | Thandar Lwin

[1] Chen Li,et al. Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[2] Felix Naumann,et al. XML Duplicate Detection Using Sorted Neighborhoods , 2006, EDBT.

[3] Jiawei Han,et al. Object Matching for Information Integration: A Profiler-Based Approach , 2003, IIWeb.

[4] Matthew A. Jaro,et al. Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[5] Raymond J. Mooney,et al. Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[6] Jayant Madhavan,et al. Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[7] Mauricio Antonio Hernandez-Sherrington. A generalization of band joins and the merge/purge problem , 1996 .

[8] Sudipto Guha,et al. Approximate XML joins , 2002, SIGMOD '02.

[9] William E. Winkler,et al. Advanced Methods For Record Linkage , 1994 .

[10] Altigran Soares da Silva,et al. Finding similar identities among objects from multiple web sources , 2003, WIDM '03.

[11] Salvatore J. Stolfo,et al. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[12] Peter Fankhauser,et al. A Precise Blocking Method for Record Linkage , 2005, DaWaK.

[13] H B NEWCOMBE,et al. Automatic linkage of vital records. , 1959, Science.