Entity Resolution in Texts Using Statistical Learning and Ontologies

Ambiguities, which are inherently present in natural languages represent a challenge of determining the actual identities of entities mentioned in a document (e.g., Paris can refer to a city in France but it can also refer to a small city in Texas, USA or to a 1984 film directed by Wim Wenders having title Paris, Texas). Disambiguation is a problem that can be successfully solved by entity resolution methods. This paper studies various methods for estimating relatedness between entities, used in collective entity resolution. We define a unified entity resolution approach, capable of using implicit as well as explicit relatedness for collectively identifying in-text entities. As a relatedness measure, we propose a method, which expresses relatedness using the heterogeneous relations of a domain ontology. We also experiment with other relatedness measures, such as using statistical learning of co-occurrences of two entities or using content similarity between them. Evaluation on real data shows that the new methods for relatedness estimation give good results.

[1]  Hang Li,et al.  Word Clustering and Disambiguation Based on Co-occurrence Data , 1998, COLING.

[2]  Dan Roth,et al.  Semantic Integration in Text: From Ambiguous Names to Identifiable Entities , 2005, AI Mag..

[3]  Dunja Mladenic,et al.  Text Mining-Machine Learning on Documents , 2005 .

[4]  Jeffrey M. Bradshaw,et al.  Applying KAoS Services to Ensure Policy Compliance for Semantic Web Services Workflow Composition and Enactment , 2004, SEMWEB.

[5]  Andrew McCallum,et al.  Information Extraction , 2005, ACM Queue.

[6]  Alex E. Bell UML Fever: Diagnosis and Recovery , 2005, ACM Queue.

[7]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[8]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[9]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[10]  Daniel Gruhl,et al.  Disambiguation of References to Individuals , 2005 .

[11]  Ian Horrocks,et al.  The Semantic Web: The Roles of XML and RDF , 2000, IEEE Internet Comput..

[12]  Laura M. Haas,et al.  Transforming Heterogeneous Data with Database Middleware: Beyond Integration , 1999, IEEE Data Eng. Bull..

[13]  Dmitri V. Kalashnikov,et al.  A probabilistic model for entity disambiguation using relationships , 2004 .

[14]  Dmitri V. Kalashnikov,et al.  Adaptive graphical approach to entity resolution , 2007, JCDL '07.

[15]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[16]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[17]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[18]  Gerhard Weikum,et al.  SOFIE: a self-organizing framework for information extraction , 2009, WWW '09.

[19]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[20]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[21]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[22]  Sung-Hyon Myaeng,et al.  Using Mutual Information to Resolve Query Translation Ambiguities and Query Term Weighting , 1999, ACL.

[23]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[24]  Jeremy J. Carroll,et al.  Resource description framework (rdf) concepts and abstract syntax , 2003 .

[25]  Hang Li,et al.  Word Clustering and Disambiguation Based on Co-occurrence Data , 1998, COLING.

[26]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[27]  Stefan M. Rüger,et al.  Place Disambiguation with Co-occurrence Models , 2006, CLEF.

[28]  Rada Mihalcea,et al.  Unsupervised Large-Vocabulary Word Sense Disambiguation with Graph-based Algorithms for Sequence Data Labeling , 2005, HLT.

[29]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[30]  Oren Etzioni,et al.  Unsupervised Resolution of Objects and Relations on the Web , 2007, NAACL.

[31]  Amit P. Sheth,et al.  Discovering informative connection subgraphs in multi-relational graphs , 2005, SKDD.

[32]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[33]  Dunja Mladenic,et al.  Visualization of Text Document Corpus , 2005, Informatica.

[34]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[35]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[36]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[37]  Razvan C. Bunescu,et al.  Integrating Co-occurrence Statistics with Information Extraction for Robust Retrieval of Protein Interactions from Medline , 2006, BioNLP@NAACL-HLT.