Space and Time Scalability of Duplicate Detection in Graph Data

Duplicate detection consists in determining differentrepresentations of real-world objects in a database. Recent research has considered the use of relatio nships among object representations to improve duplicate detection. In the general case where re lationships form a graph, research has mainly focused on duplicate detection quality/effecti v ness. Scalability has been neglected so far, even though it is crucial for large real-world duplic ate detection tasks. In this paper we scale up duplicate detection in graph data (D DG) to large amounts of data and pairwise comparisons, using the support of a relational database system. To this end, we first generalize the process of DDG. We then present how to sca le algorithms for DDG in space (amount of data processed with limited main memory) and in ti me. Finally, we explore how complex similarity computation can be performed efficientl y. Experiments on data an order of magnitude larger than data considered so far in DDG clearly s how that our methods scale to large amounts of data not residing in main memory.

[1]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[2]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[3]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[4]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[5]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[6]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[7]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[8]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[9]  Jiawei Han,et al.  Object Matching for Information Integration: A Profiler-Based Approach , 2003, IIWeb.

[10]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[11]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[12]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[13]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[14]  Felix Naumann,et al.  DogmatiX tracks down duplicates in XML , 2005, SIGMOD '05.

[15]  Dmitri V. Kalashnikov,et al.  Exploiting relationships for object consolidation , 2005, IQIS '05.

[16]  Pedro M. Domingos,et al.  Object Identification with Attribute-Mediated Dependences , 2005, PKDD.

[17]  Jayant R. Haritsa,et al.  Analyzing Plan Diagrams of Database Query Optimizers , 2005, VLDB.

[18]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[19]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[20]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[21]  Tiziana Catarci,et al.  Structure-aware XML Object Identification , 2006, IEEE Data Eng. Bull..

[22]  Dmitri V. Kalashnikov,et al.  Domain-independent data cleaning via analysis of entity-relationship graph , 2006, TODS.

[23]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[24]  Felix Naumann,et al.  XML Duplicate Detection Using Sorted Neighborhoods , 2006, EDBT.

[25]  Philip S. Yu,et al.  LinkClus: efficient clustering via heterogeneous semantic links , 2006, VLDB.

[26]  Felix Naumann,et al.  Relationship-Based Duplicate Detection , 2006 .

[27]  Felix Naumann,et al.  Detecting Duplicates in Complex XML Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[28]  Lise Getoor,et al.  Query-time entity resolution , 2006, KDD '06.

[29]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[30]  Divesh Srivastava,et al.  Group Linkage , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[31]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.