Efficient Extraction of Protein-Protein Interactions from Full-Text Articles

Proteins and their interactions govern virtually all cellular processes, such as regulation, signaling, metabolism, and structure. Most experimental findings pertaining to such interactions are discussed in research papers, which, in turn, get curated by protein interaction databases. Authors, editors, and publishers benefit from efforts to alleviate the tasks of searching for relevant papers, evidence for physical interactions, and proper identifiers for each protein involved. The BioCreative II.5 community challenge addressed these tasks in a competition-style assessment to evaluate and compare different methodologies, to make aware of the increasing accuracy of automated methods, and to guide future implementations. In this paper, we present our approaches for protein-named entity recognition, including normalization, and for extraction of protein-protein interactions from full text. Our overall goal is to identify efficient individual components, and we compare various compositions to handle a single full-text article in between 10 seconds and 2 minutes. We propose strategies to transfer document-level annotations to the sentence-level, which allows for the creation of a more fine-grained training corpus; we use this corpus to automatically derive around 5,000 patterns. We rank sentences by relevance to the task of finding novel interactions with physical evidence, using a sentence classifier built from this training corpus. Heuristics for paraphrasing sentences help to further remove unnecessary information that might interfere with patterns, such as additional adjectives, clauses, or bracketed expressions. In BioCreative II.5, we achieved an f-score of 22 percent for finding protein interactions, and 43 percent for mapping proteins to UniProt IDs; disregarding species, f-scores are 30 percent and 55 percent, respectively. On average, our best-performing setup required around 2 minutes per full text. All data and pattern sets as well as Java classes that extend third-party software are available as supplementary information (see Appendix).

[1]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[2]  Advaith Siddharthan,et al.  Syntactic Simplification and Text Cohesion , 2006 .

[3]  Ethel Ong,et al.  Simplifying Text in Medical Literature , 2008 .

[4]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[5]  Michael Schroeder,et al.  Inter-species normalization of gene mentions with GNAT , 2008, ECCB.

[6]  Eugene Charniak,et al.  Self-Training for Biomedical Parsing , 2008, ACL.

[7]  Irina M. Armean,et al.  The IntAct molecular interaction database in 2010 , 2009, Nucleic Acids Res..

[8]  Zhiyong Lu,et al.  OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression , 2008, BMC Bioinformatics.

[9]  Adam J. Smith,et al.  The Database of Interacting Proteins: 2004 update , 2004, Nucleic Acids Res..

[10]  Sarah A. Teichmann,et al.  Principles of protein-protein interactions , 2002, ECCB.

[11]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[12]  James R. Curran,et al.  Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models , 2007, Computational Linguistics.

[13]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[14]  Udo Hahn,et al.  High-performance gene name normalization with GENO , 2009, Bioinform..

[15]  Ioannis Xenarios,et al.  DIP: The Database of Interacting Proteins: 2001 update , 2001, Nucleic Acids Res..

[16]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[17]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[18]  Livia Perfetto,et al.  MINT, the molecular interaction database: 2009 update , 2009, Nucleic Acids Res..

[19]  Henning Hermjakob,et al.  Submit Your Interaction Data the IMEx Way , 2007, Proteomics.

[20]  Siddhartha Jonnalagadda,et al.  Sentence Simplification Aids Protein-Protein Interaction Extraction , 2010, ArXiv.

[21]  A. Valencia,et al.  Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge , 2008, Genome Biology.

[22]  Ulf Leser,et al.  High-performance information extraction with AliBaba , 2009, EDBT '09.

[23]  Jun'ichi Tsujii,et al.  Evaluating contributions of natural language parsers to protein–protein interaction extraction , 2008, Bioinform..

[24]  Anna Tramontano,et al.  Critical assessment of methods of protein structure prediction—Round VII , 2007, Proteins.

[25]  Jari Björne,et al.  A Graph Kernel for Protein-Protein Interaction Extraction , 2008, BioNLP.

[26]  Dietrich Rebholz-Schuhmann,et al.  Measuring prediction capacity of individual verbs for the identification of protein interactions , 2010, J. Biomed. Informatics.

[27]  U. Leser,et al.  Gene mention normalization and interaction extraction with context models and sentence motifs , 2008, Genome Biology.

[28]  Sylvie Lalonde,et al.  Molecular and cellular approaches for the detection of protein-protein interactions: latest techniques and current limitations. , 2008, The Plant journal : for cell and molecular biology.

[29]  Ellen M. Voorhees,et al.  TREC genomics special issue overview , 2009, Information Retrieval.

[30]  A Valencia,et al.  An Overview of BioCreative II.5 , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[31]  Daniel Berleant,et al.  Mining MEDLINE: Abstracts, Sentences, or Phrases? , 2001, Pacific Symposium on Biocomputing.

[32]  Rafael C. Jimenez,et al.  The IntAct molecular interaction database in 2012 , 2011, Nucleic Acids Res..

[33]  Maria Victoria Schneider,et al.  MINT: a Molecular INTeraction database. , 2002, FEBS letters.

[34]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP)—round IX , 2011, Proteins.

[35]  Jay J Thelen,et al.  Biochemical approaches for discovering protein-protein interactions. , 2008, The Plant journal : for cell and molecular biology.

[36]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[37]  Byoung-Tak Zhang,et al.  PIE: an online prediction system for protein–protein interactions from text , 2008, Nucleic Acids Res..

[38]  Chris Sander,et al.  Introducing meta-services for biomedical information extraction , 2008, Genome Biology.

[39]  A. Barabasi,et al.  High-Quality Binary Protein Interaction Map of the Yeast Interactome Network , 2008, Science.

[40]  Richard Tzong-Han Tsai,et al.  Overview of BioCreative II gene mention recognition , 2008, Genome Biology.

[41]  Roman Klinger,et al.  Classical Probabilistic Models and Conditional Random Fields , 2007 .