Swoosh: a generic approach to entity resolution

We consider the entity resolution (ER) problem (also known as deduplication, or merge–purge), in which records determined to represent the same real-world entity are successively located and merged. We formalize the generic ER problem, treating the functions for comparing and merging records as black-boxes, which permits expressive and extensible ER solutions. We identify four important properties that, if satisfied by the match and merge functions, enable much more efficient ER algorithms. We develop three efficient ER algorithms: G-Swoosh for the case where the four properties do not hold, and R-Swoosh and F-Swoosh that exploit the four properties. F-Swoosh in addition assumes knowledge of the “features” (e.g., attributes) used by the match function. We experimentally evaluate the algorithms using comparison shopping data from Yahoo! Shopping and hotel information data from Yahoo! Travel. We also show that R-Swoosh (and F-Swoosh) can be used even when the four match and merge properties do not hold, if an “approximate” result is acceptable.

[1]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[2]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[3]  Hector Garcia-Molina,et al.  D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[4]  Robert E. Tarjan,et al.  Efficiency of a Good But Not Linear Set Union Algorithm , 1972, JACM.

[5]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[6]  Lifang Gu,et al.  Record Linkage: Current Practice and Future Directions , 2003 .

[7]  Gunter Saake,et al.  Extensible and similarity-based grouping for data integration , 2002, Proceedings 18th International Conference on Data Engineering.

[8]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[9]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[10]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[11]  George V. Moustakides,et al.  A Bayesian decision model for cost optimal record matching , 2003, The VLDB Journal.

[12]  Andrew McCallum,et al.  Disambiguating Web appearances of people in a social network , 2005, WWW '05.

[13]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[14]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[15]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[16]  Amihai Motro,et al.  Fusionplex: resolution of data inconsistencies in the integration of heterogeneous information sources , 2006, Inf. Fusion.

[17]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[18]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[19]  Dmitri V. Kalashnikov,et al.  Exploiting Relationships for Domain-Independent Data Cleaning , 2005, SDM.

[20]  W. Winkler USING THE EM ALGORITHM FOR WEIGHT COMPUTATION IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 2000 .

[21]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[22]  Avrim Blum,et al.  Correlation Clustering , 2004, Machine Learning.

[23]  Pedro M. Domingos Multi-Relational Record Linkage , 2003 .

[24]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[25]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[26]  Pedro M. Domingos,et al.  Object Identification with Attribute-Mediated Dependences , 2005, PKDD.

[27]  Matthias Blume Automatic Entity Disambiguation : Benefits to NER , Relation Extraction , Link Analysis , and Inference , .

[28]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[29]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[30]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[31]  Hongjun Lu,et al.  Discovering and reconciling value conflicts for numerical data integration , 2001, Inf. Syst..

[32]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[33]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[34]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[35]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[36]  Hector Garcia-Molina,et al.  Generic Entity Resolution with Data Confidences , 2006, CleanDB.

[37]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[38]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.