Controlled Knowledge Base Enrichment from Web Documents

The Linked Open Data initiative brought more and more RDF data sources to be published on the Web. However, these data sources contain relatively little information compared to the documents available on the surface Web. Many annotation tools have been proposed in the last decade for the automatic construction and enrichment of knowledge bases. But, while noticeable advances are achieved for the extraction of concept instances, the extraction of semantic relations remains a challenging task when the structures and the vocabularies of the target documents are heterogeneous. In this paper, we propose a novel approach, called REISA, which allows to enrich RDF/OWL knowledge bases with semantic relations using semistructured documents annotated with concept instances. REISA produces weighted relation instances without exploiting lexico-syntactic or structure regularities in the documents. Neighbor domain entities in the annotated documents are used to generate the first sets of candidate relations according to the domain and range axioms defined in a domain ontology. The construction of these candidate sets relies on automated semantic controls performed with (i) the existing knowledge bases and (ii) the (inverse) functionality of the target relations. The weighting of the selected relation candidates is performed according to the neighborhood distance between the annotated domain entities in the document. Experiments on two real web datasets show that (i) REISA allows to extract semantic relationships with interesting precision values reaching 76,5% and that (ii) the weighting method is effective for ranking the relation candidates according to their precision.

[1]  Amit P. Sheth,et al.  Moving beyond SameAs with PLATO: partonomy detection for linked data , 2012, HT '12.

[2]  Atanas Kiryakov,et al.  KIM – a semantic platform for information extraction and retrieval , 2004, Natural Language Engineering.

[3]  Paul Buitelaar,et al.  Ontology-based Information Extraction with SOBA , 2006, LREC.

[4]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[5]  Ollivier Haemmerlé,et al.  Fuzzy Annotation of Web Data Tables Driven by a Domain Ontology , 2009, ESWC.

[6]  Nathalie Aussenac-Gilles,et al.  Designing and Evaluating Patterns for Ontology Enrichment from Texts , 2006, EKAW.

[7]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[8]  Gerhard Weikum,et al.  SOFIE: a self-organizing framework for information extraction , 2009, WWW '09.

[9]  Steffen Staab,et al.  Gimme' the context: context-driven automatic semantic annotation with C-PANKOW , 2005, WWW '05.

[10]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[11]  Steffen Staab,et al.  Managing Knowledge in a World of Networks , 2008 .

[12]  Atanas Kiryakov,et al.  KIM - Semantic Annotation Platform , 2003, SEMWEB.

[13]  Lora Aroyo,et al.  The Semantic Web: Research and Applications , 2009, Lecture Notes in Computer Science.

[14]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[15]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[16]  D. Gerber,et al.  Bootstrapping the Linked Data Web , 2011 .

[17]  Wenfei Fan,et al.  Keys with Upward Wildcards for XML , 2001, DEXA.

[18]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[19]  Gerhard Weikum,et al.  Combining linguistic and statistical analysis to extract relations from web documents , 2006, KDD '06.

[20]  Umberto Straccia,et al.  A Minimal Deductive System for General Fuzzy RDF , 2009, RR.

[21]  Nathalie Pernelle,et al.  Incremental Ontology-Based Extraction and Alignment in Semi-structured Documents , 2009, DEXA.