Detecting duplicate objects in XML documents

The problem of detecting duplicate entities that describe the same real-world object (and purging them) is an important data cleansing task, necessary to improve data quality. For data stored in a flat relation, numerous solutions to this problem exist. As XML becomes increasingly popular for data representation, algorithms to detect duplicates in nested XML documents are required.In this paper, we present a domain-independent algorithm that effectively identifies duplicates in an XML document. The solution adopts a top-down traversal of the XML tree structure to identify duplicate elements on each level. Pairs of duplicate elements are detected using a thresholded similarity function, and are then clustered by computing the transitive closure. To minimize the number of pairwise element comparisons, an appropriate filter function is used. The similarity measure involves string similarity for pairs of strings, which is measured using their edit distance. To increase efficiency, we avoid the computation of edit distance for pairs of strings using three filtering methods subsequently. First experiments show that our approach detects XML duplicates accurately and efficiently.

[1]  Jaideep Srivastava,et al.  Entity Identification in Database Integration , 1996, Inf. Sci..

[2]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[3]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[4]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[5]  Ilaria Bartolini,et al.  String Matching with Metric Trees Using an Approximate Distance , 2002, SPIRE.

[6]  Matthew A. Jaro,et al.  Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[7]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[8]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[9]  William E. Winkler,et al.  Advanced Methods For Record Linkage , 1994 .

[10]  Raymond J. Mooney,et al.  Employing Trainable String Similarity Metrics for Information Integration , 2003, IIWeb.

[11]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[12]  Jeremy A. Hylton,et al.  Identifying and Merging Related Bibliographic Records , 1996 .

[13]  Hans-Peter Kriegel,et al.  Efficient Similarity Search for Hierarchical Data in Large Databases , 2004, EDBT.

[14]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[15]  William E. Winkler Data Cleaning Methods , 2003 .

[16]  Dallan Quass,et al.  Record Linkage for Genealogical Databases , 2003 .

[17]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[18]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[19]  Hector Garcia-Molina,et al.  Duplicate Removal in Information Dissemination , 1998 .

[20]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[21]  Torsten Schlieder Schema-Driven Evaluation of Approximate Tree-Pattern Queries , 2002, EDBT.

[22]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[23]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..