EXTRACTING CONTEXTUALIZED COMPLEX BIOLOGICAL EVENTS WITH RICH GRAPH‐BASED FEATURE SETS

We describe a system for extracting complex events among genes and proteins from biomedical literature, developed in context of the BioNLP’09 Shared Task on Event Extraction. For each event, the system extracts its text trigger, class, and arguments. In contrast to the approaches prevailing prior to the shared task, events can be arguments of other events, resulting in a nested structure that better captures the underlying biological statements. We divide the task into independent steps which we approach as machine learning problems. We define a wide array of features and in particular make extensive use of dependency parse graphs. A rule‐based postprocessing step is used to refine the output in accordance with the restrictions of the extraction task. In the shared task evaluation, the system achieved an F‐score of 51.95% on the primary task, the best performance among the participants. Currently, with modifications and improvements described in this article, the system achieves 52.86% F‐score on Task 1, the primary task, improving on its original performance. In addition, we extend the system also to Tasks 2 and 3, gaining F‐scores of 51.28% and 50.18%, respectively. The system thus addresses the BioNLP’09 Shared Task in its entirety and achieves the best performance on all three subtasks.

[1]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[2]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[3]  Yvan Saeys,et al.  Extracting protein-protein interactions from text using rich feature vectors and feature selection , 2008, SMBM 2008.

[4]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[5]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[6]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[7]  Eugene Charniak,et al.  Any Domain Parsing: Automatic Domain Adaptation for Natural Language Parsing , 2010 .

[8]  Jari Björne,et al.  Learning to Extract Biological Event and Relation Graphs , 2009, NODALIDA.

[9]  Christopher D. Manning,et al.  The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.

[10]  Yvan Saeys,et al.  Analyzing text in search of bio-molecular events: a high-precision machine learning framework , 2009, BioNLP@HLT-NAACL.

[11]  Jun'ichi Tsujii,et al.  Combining Multiple Layers of Syntactic Information for Protein-Protein Interaction Extraction , 2008 .

[12]  Jun'ichi Tsujii,et al.  A Markov Logic Approach to Bio-Molecular Event Extraction , 2009, BioNLP@HLT-NAACL.

[13]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[14]  Jihoon Yang,et al.  Data and text mining Kernel approaches for genic interaction extraction , 2008 .

[15]  Halil Kilicoglu,et al.  Syntactic Dependency Based Heuristics for Biological Event Extraction , 2009, BioNLP@HLT-NAACL.

[16]  Tapio Salakoski,et al.  Complex-to-Pairwise Mapping of Biological Relationships using a Semantic Network Representation , 2008 .

[17]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[18]  T. Salakoski,et al.  How Complex are Complex Protein-protein Interactions ? , 2008 .

[19]  Pieter Adriaans,et al.  A Local Alignment Kernel in the Context of NLP , 2008, COLING.

[20]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[21]  Udo Hahn,et al.  Event Extraction from Trimmed Dependency Graphs , 2009, BioNLP@HLT-NAACL.

[22]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.