Duplicate detection in XML data

Duplicate detection consists in detecting multiple representations of a same real-world object, and that for every object represented in a data source. Duplicate detection is relevant in data cleaning and data integration applications and has been studied extensively for relational data describing a single type of object in a single table. Our research focuses on iterative duplicate detection in XML data. We consider detecting duplicates in multiple types of objects related to each other and devise methods adapted to semi-structured XML data. Relationships between different types of objects either form a hierarchical structure (given by the hierarchical XML structure) or a graph structure (e.g.,given by referential constraints). Iterative duplicate detection require a similarity measure to compare pairs of object representations, called candidates, based on descriptive information of a candidate. The distinction between candidates and their description is not straightforward in XML, but we show that we can semi-automatically determine these descriptions using heuristics and conditions. Second, we define a similarity measure that is suited for XML data and that considers relationships between candidates. It considers data comparability, data relevance, data similarity, and distinguishes between missing and contradictory data. Experimental evaluation shows that our similarity measure outperforms existing similarity measures in terms of effectiveness. To avoid pairwise comparisons and thereby improve efficiency, but without compromising effectiveness, we propose three comparison strategies: The topdown algorithm is suited for hierarchically related candidates where nesting represents 1:N relationships, whereas the bottom-up algorithm is suited when nested candidates actually exist in an M:N relationships in the real world. When candidate relationships form a graph, the Reconsidering Algorithm re-compares candidate pairs to improve effectiveness. Using a comparison order that reduces the number of re-comparisons nevertheless allows efficient and effective duplicate detection. To scale to large amounts of data, we propose methods that interact with a database to handle retrieval, classification, and update of iteratively compared candidate pairs. Empirical evaluation shows that these methods scale linearly in time with the number of candidates, the connectivity of the graph, and the fraction of duplicates among all candidates. We are the first to obtain linear behavior for all these parameters, so, in summary, this thesis presents novel methods for effective, efficient, and scalable duplicate detection in graph, and in particular in XML data. Finally, we present XClean, the first system for declarative XML data cleaning, for which we defined operators and a specification language that is compiled to XQuery.

[1]  Felix Naumann,et al.  Automatic Data Fusion with HumMer , 2005, VLDB.

[2]  Tok Wang Ling,et al.  A knowledge-based approach for duplicate elimination in data cleaning , 2001, Inf. Syst..

[3]  Bernard Rous,et al.  The ACM digital library , 2001, CACM.

[4]  Felix Naumann,et al.  Relationship-Based Duplicate Detection , 2006 .

[5]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[6]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[7]  Kyuseok Shim,et al.  Query Optimization in the Presence of Foreign Functions , 1993, VLDB.

[8]  Felix Naumann,et al.  A Duplicate Detection Benchmark for XML ( and Relational ) Data , 2006 .

[9]  Hector Garcia-Molina,et al.  Duplicate Removal in Information Dissemination , 1998 .

[10]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[11]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[12]  Lluís A. Belanche Muñoz,et al.  Feature selection algorithms: a survey and experimental evaluation , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[13]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[14]  Mikhail Bilenko and Raymond J. Mooney,et al.  On Evaluation and Training-Set Construction for Duplicate Detection , 2003 .

[15]  Felix Naumann,et al.  XML Duplicate Detection Using Sorted Neighborhoods , 2006, EDBT.

[16]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[17]  Tiziana Catarci,et al.  Structure-aware XML Object Identification , 2006, IEEE Data Eng. Bull..

[18]  Byung-Won On,et al.  Effective and scalable solutions for mixed and split citation problems in digital libraries , 2005, IQIS '05.

[19]  Altigran Soares da Silva,et al.  Finding similar identities among objects from multiple web sources , 2003, WIDM '03.

[20]  Felix Naumann,et al.  XStruct: Efficient Schema Extraction from Multiple and Large XML Documents , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[21]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[22]  Wei-Ying Ma,et al.  Object-level Vertical Search , 2007, CIDR.

[23]  Dallan Quass,et al.  Record Linkage for Genealogical Databases , 2003 .

[24]  Hamid Pirahesh,et al.  Extending XQuery for analytics , 2005, SIGMOD '05.

[25]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[26]  Peter Fankhauser,et al.  Unsupervised Duplicate Detection Using Sample Non-duplicates , 2006, J. Data Semant..

[27]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[28]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[29]  Peter Fankhauser,et al.  A Precise Blocking Method for Record Linkage , 2005, DaWaK.

[30]  John Mylopoulos,et al.  Representing and querying data transformations , 2005, 21st International Conference on Data Engineering (ICDE'05).

[31]  Dennis Shasha,et al.  Declaratively Cleaning your Data with AJAX , 2000, BDA.

[32]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[33]  Felix Naumann,et al.  DogmatiX tracks down duplicates in XML , 2005, SIGMOD '05.

[34]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[35]  Andrew McCallum,et al.  Object Consolodation by Graph Partitioning with a Conditionally›Trained Distance Metric , 2003 .

[36]  Ioana Manolescu,et al.  Declarative XML Data Cleaning with XClean , 2007, CAiSE.

[37]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[38]  Hector Garcia-Molina,et al.  Generic Entity Resolution with Data Confidences , 2006, CleanDB.

[39]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[40]  Jiawei Han,et al.  Profile-Based Object Matching for Information Integration , 2003, IEEE Intell. Syst..

[41]  Renée J. Miller,et al.  ConQuer: efficient management of inconsistent databases , 2005, SIGMOD '05.

[42]  Pedro M. Domingos Multi-Relational Record Linkage , 2003 .

[43]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[44]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[45]  Philip S. Yu,et al.  LinkClus: efficient clustering via heterogeneous semantic links , 2006, VLDB.

[46]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[47]  Dmitri V. Kalashnikov,et al.  Exploiting relationships for object consolidation , 2005, IQIS '05.

[48]  William W. Cohen,et al.  Contextual search and name disambiguation in email using graphs , 2006, SIGIR.

[49]  Felix Naumann,et al.  Data Fusion in Three Steps: Resolving Schema, Tuple, and Value Inconsistencies , 2006, IEEE Data Eng. Bull..

[50]  Felix Naumann,et al.  Informationsintegration - Architekturen und Methoden zur Integration verteilter und heterogener Datenquellen , 2006 .

[51]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[52]  Jayant R. Haritsa,et al.  Analyzing Plan Diagrams of Database Query Optimizers , 2005, VLDB.

[53]  Felix Naumann,et al.  Schema matching using duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[54]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[55]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[56]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[57]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[58]  Surajit Chaudhuri,et al.  Data cleaning in microsoft SQL server 2005 , 2005, SIGMOD '05.

[59]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[60]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[61]  Ilaria Bartolini,et al.  String Matching with Metric Trees Using an Approximate Distance , 2002, SPIRE.

[62]  Matthew A. Jaro,et al.  Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[63]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[64]  Pedro M. Domingos,et al.  Object Identification with Attribute-Mediated Dependences , 2005, PKDD.

[65]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[66]  Kaizhong Zhang,et al.  Exact and approximate algorithms for unordered tree matching , 1994, IEEE Trans. Syst. Man Cybern..

[67]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[68]  Jeremy A. Hylton,et al.  Identifying and Merging Related Bibliographic Records , 1996 .

[69]  Hans-Peter Kriegel,et al.  Efficient Similarity Search for Hierarchical Data in Large Databases , 2004, EDBT.

[70]  Christopher D. Manning,et al.  Using Feature Conjunctions Across Examples for Learning Pairwise Classifiers , 2004, ECML.

[71]  Terence John Parr,et al.  ANother Tool for Language Recognition , 2005 .

[72]  Jaideep Srivastava,et al.  Entity identification in database integration , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[73]  Yannis Papakonstantinou,et al.  Object Fusion in Mediator Systems , 1996, VLDB.

[74]  Dmitri V. Kalashnikov,et al.  Domain-independent data cleaning via analysis of entity-relationship graph , 2006, TODS.

[75]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[76]  Felix Naumann,et al.  Detecting Duplicates in Complex XML Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[77]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[78]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[79]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[80]  Jennifer Widom,et al.  Database systems - the complete book (international edition) , 2002 .

[81]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[82]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[83]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[84]  Felix Naumann,et al.  Declarative Data Fusion - Syntax, Semantics, and Implementation , 2005, ADBIS.

[85]  Raghu Ramakrishnan,et al.  DBLife: A Community Information Management Platform for the Database Research Community (Demo) , 2007, CIDR.

[86]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[87]  Pradeep Ravikumar,et al.  A Hierarchical Graphical Model for Record Linkage , 2004, UAI.

[88]  Elliotte Rusty Harold,et al.  XML in a Nutshell , 2001 .