Extracting Protein Interactions from Text with the Unified AkaneRE Event Extraction System

Currently, relation extraction (RE) and event extraction (EE) are the two main streams of biological information extraction. In 2009, the majority of these RE and EE research efforts were centered around the BioCreative II.5 Protein-Protein Interaction (PPI) challenge and the “BioNLP event extraction shared task.” Although these challenges took somewhat different approaches, they share the same ultimate goal of extracting bio-knowledge from the literature. This paper compares the two challenge task definitions, and presents a unified system that was successfully applied in both these and several other PPI extraction task settings. The AkaneRE system has three parts: A core engine for RE, a pool of modules for specific solutions, and a configuration language to adapt the system to different tasks. The core engine is based on machine learning, using either Support Vector Machines or Statistical Classifiers and features extracted from given training data. The specific modules solve tasks like sentence boundary detection, tokenization, stemming, part-of-speech tagging, parsing, named entity recognition, generation of potential relations, generation of machine learning features for each relation, and finally, assignment of confidence scores and ranking of candidate relations. With these components, the AkaneRE system produces state-of-the-art results, and the system is freely available for academic purposes at http://www-tsujii.is.s.u-tokyo.ac.jp/satre/akane/.

[1]  C. Sander,et al.  The HUPO PSI's Molecular Interaction format—a community standard for the representation of protein interaction data , 2004, Nature Biotechnology.

[2]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[3]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[4]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[5]  Adrian J. Shepherd,et al.  A realistic assessment of methods for extracting gene/protein interactions from free text , 2009, BMC Bioinformatics.

[6]  Jun'ichi Tsujii,et al.  Syntactic Features for Protein-Protein Interaction Extraction , 2007, LBM.

[7]  Igor Jurisica,et al.  Evaluation of linguistic features useful in extraction of interactions from PubMed; Application to annotating known, high-throughput and predicted interactions in I2D , 2009, Bioinform..

[8]  Jun'ichi Tsujii,et al.  Evaluating contributions of natural language parsers to protein–protein interaction extraction , 2008, Bioinform..

[9]  K. Bretonnel Cohen,et al.  U-Compare: share and compare text mining tools with UIMA , 2009, Bioinform..

[10]  Alessandro Moschitti,et al.  Making Tree Kernels Practical for Natural Language Learning , 2006, EACL.

[11]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[12]  William R. Hersh,et al.  Tasks, topics and relevance judging for the TREC Genomics Track: five years of experience evaluating biomedical text information retrieval systems , 2009, Information Retrieval.

[13]  Zhiyong Lu,et al.  OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression , 2008, BMC Bioinformatics.

[14]  Chris Sander,et al.  Introducing meta-services for biomedical information extraction , 2008, Genome Biology.

[15]  T. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2006, Nucleic Acids Res..

[16]  Sophia Ananiadou,et al.  Filling the Gaps Between Tools and Users: A Tool Comparator, Using Protein-Protein Interactions as an Example , 2007, Pacific Symposium on Biocomputing.

[17]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[18]  Jun'ichi Tsujii,et al.  Adapting a Probabilistic Disambiguation Model of an HPSG Parser to a New Domain , 2005, IJCNLP.

[19]  Yvan Saeys,et al.  Extracting protein-protein interactions from text using rich feature vectors and feature selection , 2008, SMBM 2008.

[20]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[21]  Jinfeng Zhang,et al.  Bayesian inference of protein-protein interactions from biological literature , 2009, Bioinform..

[22]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[23]  Jun'ichi Tsujii,et al.  From Protein-Protein Interaction to Molecular Event Extraction , 2009, BioNLP@HLT-NAACL.

[24]  A. Valencia,et al.  A text‐mining perspective on the requirements for electronically annotated abstracts , 2008, FEBS letters.

[25]  Ulf Leser,et al.  High-performance information extraction with AliBaba , 2009, EDBT '09.

[26]  Jari Björne,et al.  Comparative analysis of five protein-protein interaction corpora , 2008, BMC Bioinformatics.

[27]  Chris Cornelis,et al.  The role of syntactic features in protein interaction extraction , 2008, DTMBIO '08.

[28]  あかね 藥師寺,et al.  Relation information extraction using deep syntactic analysis , 2006 .

[29]  Sophia Ananiadou,et al.  An Annotation Type System for a Data-Driven NLP Pipeline , 2007, LAW@ACL.

[30]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[31]  Jun'ichi Tsujii,et al.  Protein-protein interaction extraction by leveraging multiple kernels and parsers , 2009, Int. J. Medical Informatics.

[32]  A Valencia,et al.  An Overview of BioCreative II.5 , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[34]  Toshihisa Takagi,et al.  Gene/Protein/Family Name Recognition in Biomedical Literature , 2004, HLT-NAACL 2004.

[35]  Jun'ichi Tsujii,et al.  A Rich Feature Vector for Protein-Protein Interaction Extraction from Multiple Corpora , 2009, EMNLP.

[36]  K Bretonnel Cohen,et al.  Journal of Biomedical Discovery and Collaboration Open Access an Open-source Framework for Large-scale, Flexible Evaluation of Biomedical Text Mining Systems , 2008 .

[37]  A. Valencia,et al.  Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge , 2008, Genome Biology.

[38]  Daniel Berleant,et al.  Mining MEDLINE: Abstracts, Sentences, or Phrases? , 2001, Pacific Symposium on Biocomputing.

[39]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[40]  Byoung-Tak Zhang,et al.  PIE: an online prediction system for protein–protein interactions from text , 2008, Nucleic Acids Res..