Web Scale Entity Resolution using Relational Evidence

Entity resolution has been extensively studied. Many approaches have been proposed, including using machine learning techniques to derive domain-specific lexical similarity measures, or rank entities’ attributes by their discriminative power, etc. In this paper, we study the problem in the setting of matching two web scale taxonomies. Besides the scale, we address the challenge that the taxonomies may not contain enough context (such as attributes) for entity resolution, and traditional lexical similarity measures result in many false positive matches. To tackle this new task, we explore negative evidence in the structure of the taxonomy, as well as in external data sources such as the web. To integrate positive and negative evidence, we formulate the entity resolution problem as a problem of finding optimal multi-way cuts in a graph. We analyze the complexity of the problem, and propose a Monte Carlo algorithm for finding greedy cuts. We conduct extensive experiments that demonstrate the advantage of our approach.

[1]  Richard M. Karp,et al.  Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems , 1972, Combinatorial Optimization.

[2]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[3]  Mihalis Yannakakis,et al.  The Complexity of Multiterminal Cuts , 1994, SIAM J. Comput..

[4]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[5]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[6]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[7]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[8]  Jiawei Han,et al.  Object Matching for Information Integration: A Profiler-Based Approach , 2003, IIWeb.

[9]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[10]  Jiawei Han,et al.  Profile-Based Object Matching for Information Integration , 2003, IEEE Intell. Syst..

[11]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[12]  Mikkel Thorup,et al.  Rounding Algorithms for a Geometric Embedding of Minimum Multiway Cut , 2004, Math. Oper. Res..

[13]  Lise Getoor,et al.  Entity Resolution in Graphs , 2005 .

[14]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[15]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[16]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[17]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[18]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[19]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[20]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[21]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[22]  Michael Langberg,et al.  The multi-multiway cut problem , 2007, Theor. Comput. Sci..

[23]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[24]  Jignesh M. Patel,et al.  Estimating the selectivity of tf-idf based cosine similarity predicates , 2007, SGMD.

[25]  Renée J. Miller,et al.  Leveraging data and structure in ontology integration , 2007, SIGMOD '07.

[26]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[27]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near duplicate detection , 2008, WWW.

[28]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[29]  Hector Garcia-Molina,et al.  Generic entity resolution with negative rules , 2009, The VLDB Journal.

[30]  Douglas W. Oard,et al.  Matching person names through name transformation , 2009, CIKM.

[31]  Xianpei Han,et al.  Named entity disambiguation by leveraging wikipedia semantic knowledge , 2009, CIKM.

[32]  Raghav Kaushik,et al.  On active learning of record matching packages , 2010, SIGMOD Conference.

[33]  Towards a Universal Taxonomy of Many Concepts , 2010 .

[34]  Haixun Wang,et al.  Toward Topic Search on the Web , 2011 .

[35]  M. Adams,et al.  Approximate Personal Name-Matching Through Finite-State Graphs , 2022 .