A hybrid approach to extract protein-protein interactions

MOTIVATION Protein-protein interactions (PPIs) play an important role in understanding biological processes. Although recent research in text mining has achieved a significant progress in automatic PPI extraction from literature, performance of existing systems still needs to be improved. RESULTS In this study, we propose a novel algorithm for extracting PPIs from literature which consists of two phases. First, we automatically categorize the data into subsets based on its semantic properties and extract candidate PPI pairs from these subsets. Second, we apply support vector machines (SVMs) to classify candidate PPI pairs using features specific for each subset. We obtain promising results on five benchmark datasets: AIMed, BioInfer, HPRD50, IEPA and LLL with F-scores ranging from 60% to 84%, which are comparable with the state-of-the-art PPI extraction systems. Furthermore, our system achieves the best performance on cross-corpora evaluation and comparative performance in terms of computational efficiency. AVAILABILITY The source code and scripts used in this article are available for academic use at http://staff.science.uva.nl/~bui/PPIs.zip CONTACT bqchinh@gmail.com.

[1]  Jihoon Yang,et al.  Data and text mining Kernel approaches for genic interaction extraction , 2008 .

[2]  Christopher K. I. Williams,et al.  Advances in Neural Information Processing Systems 15 (NIPS 2002) , 2002 .

[3]  M. Romacker,et al.  OntoGene in BioCreative II , 2007, Genome Biology.

[4]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[5]  Yvan Saeys,et al.  Extracting protein-protein interactions from text using rich feature vectors and feature selection , 2008, SMBM 2008.

[6]  Sophia Ananiadou,et al.  Text mining and its potential applications in systems biology. , 2006, Trends in biotechnology.

[7]  Jun'ichi Tsujii,et al.  Evaluating contributions of natural language parsers to protein–protein interaction extraction , 2008, Bioinform..

[8]  Jari Björne,et al.  Comparative analysis of five protein-protein interaction corpora , 2008, BMC Bioinformatics.

[9]  Claudio Giuliano,et al.  Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical Literature , 2006, EACL.

[10]  Louisa Sadler,et al.  Structural Non-Correspondence in Translation , 1991, EACL.

[11]  Jihoon Yang,et al.  Walk-weighted subsequence kernels for protein-protein interaction extraction , 2010, BMC Bioinformatics.

[12]  Jinfeng Zhang,et al.  Bayesian inference of protein-protein interactions from biological literature , 2009, Bioinform..

[13]  Yoshinobu Kano,et al.  Extracting Protein Interactions from Text with the Unified AkaneRE Event Extraction System , 2010, TCBB.

[14]  Paul H. Lewis,et al.  Proceedings of the 2nd International Symposium on Languages in Biology and Medicine , 2007 .

[15]  Razvan C. Bunescu,et al.  Subsequence Kernels for Relation Extraction , 2005, NIPS.

[16]  F Rinaldi,et al.  OntoGene in BioCreative II.5 , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[17]  Jari Björne,et al.  Complex event extraction at PubMed scale , 2010, Bioinform..

[18]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[19]  Igor Jurisica,et al.  Evaluation of linguistic features useful in extraction of interactions from PubMed; Application to annotating known, high-throughput and predicted interactions in I2D , 2009, Bioinform..

[20]  M. Vidal,et al.  Literature-curated protein interaction datasets , 2009, Nature Methods.

[21]  Jun'ichi Tsujii,et al.  Event Extraction with Complex Event Classification Using Rich Features , 2010, J. Bioinform. Comput. Biol..

[22]  Ian Witten,et al.  Data Mining , 2000 .

[23]  References , 1971 .

[24]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[25]  Chris Cornelis,et al.  Linguistic feature analysis for protein interaction extraction , 2009, BMC Bioinformatics.

[26]  Cory B. Giles,et al.  Large-scale directional relationship extraction and resolution , 2008, BMC Bioinformatics.

[27]  Quoc-Chinh Bui,et al.  Extracting causal relations on HIV drug resistance from literature , 2010, BMC Bioinformatics.

[28]  Jun'ichi Tsujii,et al.  Protein-protein interaction extraction by leveraging multiple kernels and parsers , 2009, Int. J. Medical Informatics.

[29]  Jun'ichi Tsujii,et al.  A Rich Feature Vector for Protein-Protein Interaction Extraction from Multiple Corpora , 2009, EMNLP.

[30]  Pieter W. Adriaans,et al.  Learning Relations from Biomedical Corpora Using Dependency Trees , 2006, KDECB.

[31]  M. Vidal,et al.  Literature-curated protein interaction , 2009 .

[32]  Adrian J. Shepherd,et al.  A realistic assessment of methods for extracting gene/protein interactions from free text , 2009, BMC Bioinformatics.

[33]  Noah A. Smith,et al.  Proceedings of EMNLP , 2007 .

[34]  Jun'ichi Tsujii,et al.  Syntactic Features for Protein-Protein Interaction Extraction , 2007, LBM.