Adopting ontologies for multisource identity resolution

Identity resolution aims at identifying the newly presented facts and linking them to their previous mentions. Our main hypothesis is that variations of one and the same fact can be recognised, duplications removed and their aggregation actually increases the correctness of fact extraction. Our approach to the identity problem has been implemented as Identity Resolution Framework (IdRF). The framework provides a general solution identifying known and new facts in specific domains, and it can be used in different applications for processing of different types of entity. It uses an ontology for internal and resulting knowledge representational formalism. The ontology not only contains the representation of the domain, but also known entities and properties. Apart from extracting information from textual sources, we also exploit structured information available in databases mapping the database schema to the ontology and populating the ontology with existing knowledge. Our main goal is not to advocate one criterion among the others, but to introduce widely applicable solution of the identity resolution problem, we present a set of customisable criteria as well as a mechanism new criteria to be added. We have carried two series of experiments in two different business intelligence domains - company profiling and recruitment - achieving rather encouraging result.

[1]  Kalina Bontcheva,et al.  Ontological Integration of Information Extracted from Multiple Sources , 2007 .

[2]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[3]  Horacio Saggion Experiments on Semantic-based Clustering for Cross-document Coreference , 2008, IJCNLP.

[4]  Jan-Ming Ho,et al.  Extracting Citation Relationships from Web Documents for Author Disambiguation , 2006 .

[5]  Claire Cardie,et al.  University of Massachusetts: MUC-3 test results and analysis , 1991, MUC.

[6]  Raymond J. Mooney,et al.  Employing Trainable String Similarity Metrics for Information Integration , 2003, IIWeb.

[7]  Ralph Grishman,et al.  Information Extraction: Techniques and Challenges , 1997, SCIE.

[8]  Fausto Giunchiglia,et al.  S-Match: an Algorithm and an Implementation of Semantic Matching , 2004, ESWS.

[9]  M. Thelwall,et al.  Google Scholar citations and Google Web-URL citations: A multi-discipline exploratory analysis , 2007 .

[10]  Michael Sintek,et al.  NEWS: Bringing Semantic Web Technologies into News Agencies , 2006, SEMWEB.

[11]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[12]  Maria Teresa Pazienza,et al.  Information Extraction A Multidisciplinary Approach to an Emerging Information Technology , 1997, Lecture Notes in Computer Science.

[13]  Michel C. A. Klein,et al.  Rough Description Logics for Modeling Uncertainty in Instance Unification , 2007, URSW.

[14]  Kalina Bontcheva,et al.  Mining Information for Instance Unification , 2006, SEMWEB.

[15]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[16]  Nancy A. Chinchor,et al.  Overview of MUC-7 , 1998, MUC.

[17]  Dragomir R. Radev A Common Theory of Information Fusion from Multiple Text Sources Step One: Cross-Document Structure , 2000, SIGDIAL Workshop.

[18]  Yorick Wilks,et al.  Intelligent Multimedia Indexing and Retrieval through Multi-source Information Extraction and Merging , 2003, IJCAI.

[19]  Diana Maynard,et al.  Ontology-based information extraction for market monitoring and technology watch , 2005 .

[20]  James Allan,et al.  Cross-Document Coreference on a Large Scale Corpus , 2004, NAACL.

[21]  Claire Cardie,et al.  University of Massachusetts: MUC-4 test results and analysis , 1992, MUC.

[22]  Atanas Kiryakov,et al.  D1.8.1 Base upper-level ontology (BULO) Guidance 1 , 2005 .

[23]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[24]  Alan W. Biermann,et al.  A Methodology for Cross-document Coreference Cross-document Coreference: the Problem Architecture and the Methodology , 2000 .

[25]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[26]  Atanas Kiryakov,et al.  OWLIM - A Pragmatic Semantic Repository for OWL , 2005, WISE Workshops.

[27]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[28]  Atanas Kiryakov,et al.  KIM – a semantic platform for information extraction and retrieval , 2004, Natural Language Engineering.