A theoretical framework for knowledge-based entity resolution

Abstract Entity resolution is the process of determining whether a collection of entity representations refer to the same entity in the real world. In this paper we introduce a theoretical framework that supports knowledge-based entity resolution. From a logical point of view, the expressive power of the framework is equivalent to a decidable fragment of first-order logic including conjunction, disjunction and a certain form of negation. Although the framework is expressive for representing knowledge about entity resolution in a collective way, the questions that arise are: (1) how efficiently can knowledge patterns be processed; (2) how effectively can redundancy among knowledge patterns be eliminated. In answering these questions, we first study the evaluation problem for knowledge patterns. Our results show that this problem is NP-complete w.r.t. combined complexity but in ptime w.r.t. data complexity. This nice property leads us to investigate the containment problem for knowledge patterns, which turns out to be NP-complete. We further develop a notion of optimality for knowledge patterns and a mechanism of optimizing a knowledge model (i.e. a finite set of knowledge patterns). We prove that the optimality decision problem for knowledge patterns is still NP-complete.

[1]  Pedro M. Domingos,et al.  Object Identification with Attribute-Mediated Dependences , 2005, PKDD.

[2]  Howard B. Newcombe,et al.  Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[3]  Ashok K. Chandra,et al.  Optimal implementation of conjunctive queries in relational data bases , 1977, STOC '77.

[4]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[5]  Anthony C. Klug On conjunctive queries containing inequalities , 1988, JACM.

[6]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[7]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[8]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[9]  Raghav Kaushik,et al.  On active learning of record matching packages , 2010, SIGMOD Conference.

[10]  Klaus-Dieter Schewe,et al.  Knowledge-aware identity services , 2012, Knowledge and Information Systems.

[11]  Jianzhong Li,et al.  Reasoning about Record Matching Rules , 2009, Proc. VLDB Endow..

[12]  Moshe Y. Vardi The complexity of relational query languages (Extended Abstract) , 1982, STOC '82.

[13]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[14]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[15]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[16]  David S. Johnson,et al.  Computers and In stractability: A Guide to the Theory of NP-Completeness. W. H Freeman, San Fran , 1979 .

[17]  Diego Calvanese,et al.  Decidable Containment of Recursive Queries , 2003, ICDT.

[18]  Surajit Chaudhuri,et al.  Leveraging aggregate constraints for deduplication , 2007, SIGMOD '07.

[19]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[20]  Jianzhong Li,et al.  The VLDB Journal manuscript No. (will be inserted by the editor) Dynamic Constraints for Record Matching , 2022 .

[21]  Jeffrey D. Ullman,et al.  Information integration using logical views , 1997, Theor. Comput. Sci..

[22]  Laks V. S. Lakshmanan,et al.  Declarative Entity Resolution via Matching Dependencies and Answer Set Programs , 2012, KR.

[23]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[24]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[25]  George V. Moustakides,et al.  A Bayesian decision model for cost optimal record matching , 2003, The VLDB Journal.

[26]  Klaus-Dieter Schewe,et al.  On the Decidability and Complexity of Identity Knowledge Representation , 2012, DASFAA.

[27]  Oded Shmueli,et al.  Equivalence of DATALOG Queries is Undecidable , 1993, J. Log. Program..

[28]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[29]  Fang Wei-Kleiner,et al.  Containment of Conjunctive Queries with Safe Negation , 2003, ICDT.

[30]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[31]  Xin Li,et al.  Constraint-Based Entity Matching , 2005, AAAI.

[32]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[33]  Alfred V. Aho,et al.  Equivalences Among Relational Expressions , 1979, SIAM J. Comput..

[34]  Marie-Laure Mugnier,et al.  Some Algorithmic Improvements for the Containment Problem of Conjunctive Queries with Negation , 2007, ICDT.

[35]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[36]  Lise Getoor,et al.  Deduplication and Group Detection using Links , 2004 .

[37]  Ronald Fagin Generalized first-order spectra, and polynomial. time recognizable sets , 1974 .

[38]  Alfred V. Aho,et al.  Universality of data retrieval languages , 1979, POPL.

[39]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[40]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[41]  Mihalis Yannakakis,et al.  Equivalences Among Relational Expressions with the Union and Difference Operators , 1980, J. ACM.