XML Duplicate Detection Using Sorted Neighborhoods

Detecting duplicates is a problem with a long tradition in many domains, such as customer relationship management and data warehousing. The problem is twofold: First define a suitable similarity measure, and second efficiently apply the measure to all pairs of objects. With the advent and pervasion of the XML data model, it is necessary to find new similarity measures and to develop efficient methods to detect duplicate elements in nested XML data. A classical approach to duplicate detection in flat relational data is the sorted neighborhood method, which draws its efficiency from sliding a window over the relation and comparing only tuples within that window. We extend the algorithm to cover not only a single relation but nested XML elements. To compare objects we make use of XML parent and child relationships. For efficiency, we apply the windowing technique in a bottom-up fashion, detecting duplicates at each level of the XML hierarchy. Experiments show a speedup comparable to the original method data and they show the high effectiveness of our algorithm in detecting XML duplicates.

[1]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[2]  Jan Chomicki,et al.  Hippo: A System for Computing Consistent Answers to a Class of SQL Queries , 2004, EDBT.

[3]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[4]  William E. Winkler,et al.  Advanced Methods For Record Linkage , 1994 .

[5]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[6]  Sudipto Guha,et al.  Approximate XML joins , 2002, SIGMOD '02.

[7]  Jaideep Srivastava,et al.  Entity Identification in Database Integration , 1996, Inf. Sci..

[8]  Altigran Soares da Silva,et al.  Finding similar identities among objects from multiple web sources , 2003, WIDM '03.

[9]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[10]  Matthew A. Jaro,et al.  Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[11]  Hans-Peter Kriegel,et al.  Efficient Similarity Search for Hierarchical Data in Large Databases , 2004, EDBT.

[12]  Dallan Quass,et al.  Record Linkage for Genealogical Databases , 2003 .

[13]  Felix Naumann,et al.  DogmatiX tracks down duplicates in XML , 2005, SIGMOD '05.

[14]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[15]  Mauricio Antonio Hernandez-Sherrington A generalization of band joins and the merge/purge problem , 1996 .

[16]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[17]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[18]  Jiawei Han,et al.  Profile-Based Object Matching for Information Integration , 2003, IEEE Intell. Syst..

[19]  Jiawei Han,et al.  Object Matching for Information Integration: A Profiler-Based Approach , 2003, IIWeb.

[20]  Felix Naumann,et al.  Detecting duplicate objects in XML documents , 2004, IQIS '04.

[21]  Peter Fankhauser,et al.  A Precise Blocking Method for Record Linkage , 2005, DaWaK.

[22]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.