Information Extraction as an Ontology Population Task and Its Application to Genic Interactions

Ontologies are a well-motivated formal representation to model knowledge needed to extract and encode data from text. Yet, their tight integration with Information Extraction (IE) systems is still a research issue, a fortiori with complex ones that go beyond hierarchies. In this paper, we introduce an original architecture where IE is specified by designing an ontology, and the extraction process is seen as an Ontology Population (OP) task. Concepts and relations of the ontology define a normalized text representation. As their abstraction level is irrelevant for text extraction, we introduced a Lexical Layer (LL) along with the ontology, i.e. relations and classes at an intermediate level of normalization between raw text and concepts. On the contrary to previous IE systems, the extraction process only involves normalizing the outputs of Natural Language Processing (NLP) modules with instances of the ontology and the LL. All the remaining reasoning is left to a query module, which uses the inference rules of the ontology to derive new instances by deduction. In this context, these inference rules subsume classical extraction rules or patterns by providing access to appropriate abstraction level and domain knowledge. To acquire those rules, we adopt an Ontology Learning (OL) perspective, and automatically acquire the inference rules with relational Machine Learning (ML). Our approach is validated on a genic interaction extraction task from a Bacillus subtilis bacterium text corpus. We reach a global recall of 89.3% and a precision of 89.6%, with high scores for the ten conceptual relations in the ontology.

[1]  Steffen Staab,et al.  S-CREAM: Semiautomatic CREAtion of Metadata , 2002, SAAKM@ECAI.

[2]  Paul Buitelaar,et al.  Ontology-based Information Extraction with SOBA , 2006, LREC.

[3]  Peer Bork,et al.  Large-scale Extraction of Protein/Gene Relations for Model Organisms , 2005 .

[4]  Stefan Decker,et al.  Creating Semantic Web Contents with Protégé-2000 , 2001, IEEE Intell. Syst..

[5]  Jun'ichi Tsujii,et al.  New challenges for text mining: mapping between text and manually curated pathways , 2008, BMC Bioinformatics.

[6]  Anton Yuryev,et al.  Extracting human protein interactions from MEDLINE using a full-sentence parser , 2004, Bioinform..

[7]  York Sure-Vetter,et al.  Learning Disjointness , 2007, ESWC.

[8]  Philipp Cimiano,et al.  Ontology Learning from Text: Methods, Evaluation and Applications , 2005 .

[9]  Thierry Hamon,et al.  Event-Based Information Extraction for the Biomedical Domain: the Caderige Project , 2004, NLPBA/BioNLP.

[10]  Dekang Lin,et al.  DIRT – Discovery of Inference Rules from Text , 2001 .

[11]  Jean Charlet,et al.  Construction de ressources terminologiques ou ontologiques à partir de textes Un cadre unificateur pour trois études de cas , 2004, Rev. d'Intelligence Artif..

[12]  Carol Friedman,et al.  Two biomedical sublanguages: a description based on the theories of Zellig Harris , 2002, J. Biomed. Informatics.

[13]  Claire Nédellec,et al.  Learning Language in Logic - Genic Interaction Extraction Challenge , 2005 .

[14]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[15]  Michael Kifer,et al.  Logical foundations of object-oriented and frame-based languages , 1995, JACM.

[16]  Peter Willett,et al.  Protein Structures and Information Extraction from Biological Texts: The PASTA System , 2003, Bioinform..

[17]  Steffen Staab,et al.  Bootstrapping an Ontology-Based Information Extraction System , 2003, Intelligent Exploration of the Web.

[18]  Kalina Bontcheva,et al.  Hierarchical, perceptron-like learning for ontology-based information extraction , 2007, WWW '07.

[19]  Luc De Raedt,et al.  Inductive Logic Programming: Theory and Methods , 1994, J. Log. Program..

[20]  David E. Millard,et al.  Automatic Ontology-Based Knowledge Extraction from Web Documents , 2003, IEEE Intell. Syst..

[21]  Patrick Pantel,et al.  DIRT @SBT@discovery of inference rules from text , 2001, KDD '01.

[22]  Paul Buitelaar,et al.  Ontology Learning from Text: An Overview , 2005 .

[23]  Paul Buitelaar,et al.  A Multilingual/Multimedia Lexicon Model for Ontologies , 2006, ESWC.

[24]  Jun'ichi Tsujii,et al.  Semantic Retrieval for the Accurate Identification of Relational Concepts in Massive Textbases , 2006, ACL.

[25]  Paul Buitelaar,et al.  LexOnto: A Model for Ontology Lexicons for Ontology-based NLP , 2007 .

[26]  Céline Rouveirol,et al.  Extension of the Top-Down Data-Driven Strategy to ILP , 2007, ILP.

[27]  Ellen Riloff,et al.  Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing , 1996, Lecture Notes in Computer Science.

[28]  Philipp Cimiano,et al.  Ontology-Driven Discourse Analysis in GenIE , 2003, NLDB.

[29]  Asunción Gómez-Pérez,et al.  Ontological Engineering: A state of the Art , 1999 .

[30]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[31]  Dan Brickley,et al.  Rdf vocabulary description language 1.0 : Rdf schema , 2004 .

[32]  Atanas Kiryakov,et al.  KIM – a semantic platform for information extraction and retrieval , 2004, Natural Language Engineering.

[33]  F. Rastier Le terme : Entre ontologie et linguistique , 1995 .