Query-time entity resolution

The goal of entity resolution is to reconcile database references corresponding to the same real-world entities. Given the abundance of publicly available databases where entities are not resolved, we motivate the problem of quickly processing queries that require resolved entities from such 'unclean' databases. We propose a two-stage collective resolution strategy for processing queries. We then show how it can be performed on-the-fly by adaptively extracting and resolving those database references that are the most helpful for resolving the query. We validate our approach on two large real-world publication databases where we show the usefulness of collective resolution and at the same time demonstrate the need for adaptive strategies for query processing. We then show how the same queries can be answered in real time using our adaptive approach while preserving the gains of collective resolution.

[1]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[2]  Dan Roth,et al.  Semantic Integration in Text: From Ambiguous Names to Identifiable Entities , 2005, AI Mag..

[3]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[4]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[5]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[6]  William E. Winkler,et al.  Methods for Record Linkage and Bayesian Networks , 2002 .

[7]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[8]  Dmitri V. Kalashnikov,et al.  Exploiting Relationships for Domain-Independent Data Cleaning , 2005, SDM.

[9]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[10]  Henry A. Kautz,et al.  Hardening soft information sources , 2000, KDD '00.

[11]  Dietrich Wettschereck,et al.  Relational Instance-Based Learning , 1996, ICML.

[12]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[13]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[14]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[15]  Lise Getoor,et al.  Relational clustering for multi-type entity resolution , 2005, MRDM '05.

[16]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[17]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[18]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[19]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[20]  Denise Draper,et al.  Localized Partial Evaluation of Belief Networks , 1994, UAI.

[21]  Mathias Kirsten,et al.  Relational Distance-Based Clustering , 1998, ILP.

[22]  Sunita Sarawagi,et al.  Efficient Batch Top-k Search for Dictionary-based Entity Recognition , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[23]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[24]  Pradeep Ravikumar,et al.  A Hierarchical Graphical Model for Record Linkage , 2004, UAI.

[25]  Luis Gravano,et al.  Text joins for data cleansing and integration in an RDBMS , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[26]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[27]  Lawrence B. Holder,et al.  Mining Graph Data , 2006 .

[28]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[29]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[30]  Renée J. Miller,et al.  ConQuer: efficient management of inconsistent databases , 2005, SIGMOD '05.

[31]  Pedro M. Domingos Multi-Relational Record Linkage , 2003 .

[32]  Andrew McCallum,et al.  Conditional Models of Identity Uncertainty with Application to Noun Coreference , 2004, NIPS.

[33]  A. John MINING GRAPH DATA , 2022 .

[34]  Philip S. Yu,et al.  Spectral clustering for multi-type relational data , 2006, ICML.

[35]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..