Improving Entity Resolution with Global Constraints

Some of the greatest advances in web search have come from leveraging socio-economic properties of online user behavior. Past advances include PageRank, anchor text, hubsauthorities, and TF-IDF. In this paper, we investigate another socio-economic property that, to our knowledge, has not yet been exploited: sites that create lists of entities, such as IMDB and Netix, have an incentive to avoid gratuitous duplicates. We leverage this property to resolve entities across the dierent web sites, and nd that we can obtain substantial improvements in resolution accuracy. This improvement in accuracy also translates into robustness, which often reduces the amount of training data that must be labeled for comparing entities across many sites. Furthermore, the technique provides robustness when resolving sites that have some duplicates, even without rst removing these duplicates. We present algorithms with very strong precision and recall, and show that max weight matching, while appearing to be a natural choice turns out to have poor performance in some situations. The presented techniques are now being used in the back-end entity resolution system at a major Internet search engine.

[1]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[2]  Don X. Sun,et al.  Methods for Linking and Mining Massive Heterogeneous Databases , 1998, KDD.

[3]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[4]  William E. Winkler,et al.  Advanced Methods For Record Linkage , 1994 .

[5]  Arie Segev,et al.  A Framework for Object Matching in Federated Databases and Its Implementation , 1996, Int. J. Cooperative Inf. Syst..

[6]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[7]  Peter Christen Automatic Training Example Selection for Scalable Unsupervised Record Linkage , 2008, PAKDD.

[8]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[9]  Sudha Ram,et al.  Entity identification for heterogeneous database integration--a multiple classifier system approach and empirical evaluation , 2005, Inf. Syst..

[10]  Weifeng Su,et al.  Record Matching over Query Results from Multiple Web Databases , 2010, IEEE Transactions on Knowledge and Data Engineering.

[11]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[12]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[13]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[14]  Tom. Mitchell GENERATIVE AND DISCRIMINATIVE CLASSIFIERS: NAIVE BAYES AND LOGISTIC REGRESSION Machine Learning , 2005 .

[15]  Xin Li,et al.  Constraint-Based Entity Matching , 2005, AAAI.

[16]  Surajit Chaudhuri,et al.  Leveraging aggregate constraints for deduplication , 2007, SIGMOD '07.

[17]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[18]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[19]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[20]  Jon M Kleinberg,et al.  Hubs, authorities, and communities , 1999, CSUR.

[21]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[22]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[23]  Erhard Rahm,et al.  Training selection for tuning entity matching , 2008, QDB/MUD.

[24]  Andrew McCallum,et al.  Joint deduplication of multiple record types in relational data , 2005, CIKM '05.

[25]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[26]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .