A Framework for Identity Resolution and Merging for Multi-source Information Extraction

In the context of ontology-based information extraction, identity resolution is the process of deciding whether an instance extracted from text refers to a known entity in the target domain (e.g. the ontology). We present an ontology-based framework for identity resolution which can be customized to different application domains and extraction tasks. Rules for identify resolution, which compute similarities between target and source entities based on class information and instance properties and values, can be defined for each class in the ontology. We present a case study of the application of the framework to the problem of multi-source job vacancy extraction

[1]  Susumu Horiguchi,et al.  Personal Name Resolution Crossover Documents by a Semantics-Based Approach , 2006, IEICE Trans. Inf. Syst..

[2]  Wai Lam,et al.  Meta-evaluation of Summaries in a Cross-lingual Environment using Content-based Metrics , 2002, COLING.

[3]  Horacio Saggion,et al.  Where does Information come from? Corpus Analysis for Automatic Abstracting , 1998 .

[4]  Antonio Moreno-Sandoval,et al.  CROSSING BARRIERS IN TEXT SUMMARIZATION RESEARCH , 2005 .

[5]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[6]  Horacio Saggion,et al.  Concept Identification and Presentation in the Context of Technical Text Summarization , 2000 .

[7]  Atanas Kiryakov,et al.  D1.8.1 Base upper-level ontology (BULO) Guidance 1 , 2005 .

[8]  Horacio Saggion,et al.  Summary Generation and Evaluation in SumUM , 2000, IBERAMIA-SBIA.

[9]  Horacio Saggion Experiments on Semantic-based Clustering for Cross-document Coreference , 2008, IJCNLP.

[10]  John ffitch,et al.  Course notes , 1975, SIGSAM Bull..

[11]  Kalina Bontcheva,et al.  Mining Information for Instance Unification , 2006, SEMWEB.

[12]  Wai Lam,et al.  Developing Infrastructure for the Evaluation of Single and Multi-document Summarization Systems in a Cross-lingual Environment , 2002, LREC.

[13]  Horacio Saggion,et al.  The generation of abstracts by selective analysis , 1998 .

[14]  Atanas Kiryakov,et al.  OWLIM - A Pragmatic Semantic Repository for OWL , 2005, WISE Workshops.

[15]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[16]  Kalina Bontcheva,et al.  Extracting Information for Automatic Indexing of Multimedia Material , 2002, LREC.

[17]  Atanas Kiryakov,et al.  KIM – a semantic platform for information extraction and retrieval , 2004, Natural Language Engineering.

[18]  Kalina Bontcheva,et al.  Ontological Integration of Information Extracted from Multiple Sources , 2007 .

[19]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[20]  Alan W. Biermann,et al.  A Methodology for Cross-document Coreference Cross-document Coreference: the Problem Architecture and the Methodology , 2000 .

[21]  Mark T. Maybury,et al.  Automatic Summarization , 2002, Computational Linguistics.

[22]  Ralph Grishman,et al.  Information Extraction: Techniques and Challenges , 1997, SCIE.

[23]  Raymond J. Mooney,et al.  Employing Trainable String Similarity Metrics for Information Integration , 2003, IIWeb.

[24]  Michel C. A. Klein,et al.  Rough Description Logics for Modeling Uncertainty in Instance Unification , 2007, URSW.

[25]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[26]  Kalina Bontcheva,et al.  Using a text engineering framework to build an extendable and portable IE-based summarisation system , 2002, ACL 2002.

[27]  Fausto Giunchiglia,et al.  S-Match: an Algorithm and an Implementation of Semantic Matching , 2004, ESWS.

[28]  R. P. van de Riet,et al.  Applications of Natural Language to Information Systems: Proceedings of the Second International Workshop June 26-28, 1996, Amsterdam, the Netherlands , 1996 .