A Declarative Framework for Linking Entities

We introduce and develop a declarative framework for entity linking and, in particular, for entity resolution. As in some earlier approaches, our framework is based on a systematic use of constraints. However, the constraints we adopt are link-to-source constraints, unlike in earlier approaches where source-to-link constraints were used to dictate how to generate links. Our approach makes it possible to focus entirely on the intended properties of the outcome of entity linking, thus separating the constraints from any procedure of how to achieve that outcome. The core language consists of link-to-source constraints that specify the desired properties of a link relation in terms of source relations and built-in predicates such as similarity measures. A key feature of the link-to-source constraints is that they employ disjunction, which enables the declarative listing of all the reasons two entities should be linked. We also consider extensions of the core language that capture collective entity resolution by allowing interdependencies among the link relations. We identify a class of “good” solutions for entity-linking specifications, which we call maximum-value solutions and which capture the strength of a link by counting the reasons that justify it. We study natural algorithmic problems associated with these solutions, including the problem of enumerating the “good” solutions and the problem of finding the certain links, which are the links that appear in every “good” solution. We show that these problems are tractable for the core language but may become intractable once we allow interdependencies among the link relations. We also make some surprising connections between our declarative framework, which is deterministic, and probabilistic approaches such as ones based on Markov Logic Networks.

[1]  Salil P. Vadhan,et al.  Computational Complexity , 2005, Encyclopedia of Cryptography and Security.

[2]  Ronald Fagin,et al.  Data exchange: semantics and query answering , 2003, Theor. Comput. Sci..

[3]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[4]  Renée J. Miller,et al.  A framework for semantic link discovery over relational data , 2009, CIKM.

[5]  BhattacharyaIndrajit,et al.  Collective entity resolution in relational data , 2007 .

[6]  Ronald Fagin,et al.  Solutions and query rewriting in data exchange , 2013, Inf. Comput..

[7]  Jan Chomicki,et al.  Minimal-change integrity maintenance using tuple deletions , 2002, Inf. Comput..

[8]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[9]  Alon Itai,et al.  Some Matching Problems for Bipartite Graphs , 1978, JACM.

[10]  Laks V. S. Lakshmanan,et al.  Data Cleaning and Query Answering with Matching Dependencies and Matching Functions , 2010, ICDT '11.

[11]  Rajasekar Krishnamurthy,et al.  High-Level Rules for Integration and Analysis of Data: New Challenges , 2013, In Search of Elegance in the Theory and Practice of Computation.

[12]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[13]  K. Fukuda,et al.  Finding All The Perfect Matchings in Bipartite Graphs , 1989 .

[14]  Gerhard Weikum,et al.  ACM Transactions on Database Systems , 2005 .

[15]  Mihalis Yannakakis,et al.  On Generating All Maximal Independent Sets , 1988, Inf. Process. Lett..

[16]  Ronald Fagin,et al.  A Declarative Framework for Linking Entities , 2015, ICDT.

[17]  Anish Das Sarma,et al.  Data Cleaning: A Practical Perspective , 2013, Data Cleaning: A Practical Perspective.

[18]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[19]  Frederick Reiss,et al.  Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! , 2013, EMNLP.

[20]  Rajasekar Krishnamurthy,et al.  Extracting, Linking and Integrating Data from Public Sources: A Financial Case Study , 2015, IEEE Data Eng. Bull..

[21]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[22]  Katta G. Murty,et al.  Letter to the Editor - An Algorithm for Ranking all the Assignments in Order of Increasing Cost , 1968, Oper. Res..

[23]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[24]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[25]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[26]  Wenfei Fan,et al.  Dependencies revisited for improving data quality , 2008, PODS.

[27]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[28]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[29]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[30]  Lise Getoor,et al.  Online Collective Entity Resolution , 2007, AAAI.

[31]  Chandra R. Chegireddy,et al.  Algorithms for finding K-best perfect matchings , 1987, Discret. Appl. Math..

[32]  Wenfei Fan,et al.  Foundations of Data Quality Management , 2012, Foundations of Data Quality Management.

[33]  Ashwin Machanavajjhala,et al.  Entity Resolution: Theory, Practice & Open Challenges , 2012, Proc. VLDB Endow..

[34]  Nils M. Kriege,et al.  Enumeration of Maximum Common Subtree Isomorphisms with Polynomial-Delay , 2014, ISAAC.

[35]  Jack Edmonds,et al.  Maximum matching and a polyhedron with 0,1-vertices , 1965 .

[36]  Rajasekar Krishnamurthy,et al.  HIL: a high-level scripting language for entity integration , 2013, EDBT '13.

[37]  Peter Jonsson,et al.  Recognizing frozen variables in constraint satisfaction problems , 2004, Theor. Comput. Sci..

[38]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.