Ontology-based Technical Text Annotation

Powerful tools could help users explore and maintain domain specific documentations, provided that documents have been semantically annotated. For that, the annotations must be sufficiently specialized and rich, relying on some explicit semantic model, usually an ontology, that repre- sents the semantics of the target domain. In this paper, we learn to annotate biomedical scientific publications with respect to a Gene Regulation Ontology. We devise a two-step approach to an- notate semantic events and relations. The first step is recast as a text segmentation and labeling problem and solved using machine translation tools and a CRF, the second as multi-class classi- fication. We evaluate the approach on the BioNLP-GRO benchmark, achieving an average 61% F-measure on the event detection by itself and 50% F-measure on biological relation annotation. This suggests that human annotators can be supported in domain specific semantic annotation tasks. Under different experimental settings, we also conclude some interesting observations: (1) For event detection and compared to classical time-consuming sequence labeling approach, the newly proposed machine translation based method performed equally well but with much less computation resource required. (2) A highly domain specific part of the task, namely proteins and transcription factors detection, is best performed by domain aware tools, which can be used separately as an initial step of the pipeline.

[1]  David R. Dowty Thematic proto-roles and argument selection , 1991 .

[2]  Dietrich Rebholz-Schuhmann,et al.  EBIMed - text crunching to gather facts for proteins from Medline , 2007, Bioinform..

[3]  Xu Han,et al.  GRO Task: Populating the Gene Regulation Ontology with events and relations , 2013, BioNLP@ACL.

[4]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[5]  Yefeng Wang,et al.  Annotating and Recognising Named Entities in Clinical Notes , 2009, ACL.

[6]  François Yvon,et al.  Practical Very Large Scale CRFs , 2010, ACL.

[7]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[8]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[9]  Jari Björne,et al.  TEES 2.1: Automated Annotation Scheme Learning in the BioNLP 2013 Shared Task , 2013, BioNLP@ACL.

[10]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[11]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[12]  Sophia Ananiadou,et al.  FACTA: a text search engine for finding associated biomedical concepts , 2008, Bioinform..

[13]  Siegfried Handschuh,et al.  Semantic annotation for knowledge management: Requirements and a survey of the state of the art , 2006, J. Web Semant..

[14]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[15]  Ramanathan V. Guha,et al.  TAP: A Semantic Web Test-bed , 2003, J. Web Semant..

[16]  Dietrich Rebholz-Schuhmann,et al.  Improving the extraction of complex regulatory events from scientific text by using ontology-based inference , 2011, Semantic Mining in Biomedicine.

[17]  Daniel Marcu,et al.  A Phrase-Based,Joint Probability Model for Statistical Machine Translation , 2002, EMNLP.

[18]  Nancy Ide,et al.  Using the Right Tools: Enhancing Retrieval from Marked-up Documents , 1999, Comput. Humanit..

[19]  David Chiang,et al.  Hierarchical Phrase-Based Translation , 2007, CL.

[20]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[21]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[22]  Ming Zhou,et al.  Recognizing Named Entities in Tweets , 2011, ACL.

[23]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[24]  Alexiei Dingli,et al.  Learning to Harvest Information for the Semantic Web , 2004, ESWS.

[25]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[26]  Sebastian Riedel,et al.  Inter-Event Dependencies support Event Extraction from Biomedical Literature , 2011 .

[27]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[28]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..

[29]  François Lévy,et al.  Integrating Written Policies in Business Rule Management Systems , 2011, RuleML Europe.

[30]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[31]  Atanas Kiryakov,et al.  Semantic annotation, indexing, and retrieval , 2004, J. Web Semant..

[32]  Dietrich Rebholz-Schuhmann,et al.  Gene Regulation Ontology (GRO): Design Principles and Use Cases , 2008, MIE.