Efficiently mining protein interaction dependencies from large text corpora.

Biochemical research has yielded an extensive amount of information about dependencies between protein interactions, as generated by allosteric regulations, steric hindrance and other mechanisms. Collectively, this information is valuable for understanding large intracellular protein networks. However, this information is sparsely distributed among millions of publications and documented as freely styled text meant for manual reading. Here we develop a computational approach for extracting information about interaction dependencies from large numbers of publications. First, keyword-based tokenization reduces full papers to short strings, facilitating an efficient search for patterns that are likely to indicate descriptions of interaction dependencies. Sentences that match such patterns are extracted, thereby reducing the amount of text to be read by human curators. Application of this approach to the integrin adhesome network extracted from 59,933 papers 208 short statements, close to half of which indeed describe interaction dependencies. We visualize the obtained hypernetwork of dependencies and illustrate that these dependencies confine the feasible mechanisms of adhesion sites assembly and generate testable hypotheses about their switchability.

[1]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[2]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[3]  Toshihisa Takagi,et al.  Automated extraction of information on protein-protein interactions from the biological literature , 2001, Bioinform..

[4]  K. Bretonnel Cohen,et al.  Intrinsic Evaluation of Text Mining Tools May Not Predict Performance on Realistic Tasks , 2007, Pacific Symposium on Biocomputing.

[5]  Michael Schroeder,et al.  Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? , 2008, Briefings Bioinform..

[6]  K. Bretonnel Cohen,et al.  Frontiers of biomedical text mining: current progress , 2007, Briefings Bioinform..

[7]  Benjamin Geiger,et al.  Dynamics and segregation of cell–matrix adhesions in cultured fibroblasts , 2000, Nature Cell Biology.

[8]  Mark R. Gilder,et al.  Extraction of protein interaction information from unstructured text using a context-free grammar , 2003, Bioinform..

[9]  Benjamin Geiger,et al.  The switchable integrin adhesome , 2010, Journal of Cell Science.

[10]  Hongfei Lin,et al.  BioPPISVMExtractor: A protein-protein interaction extractor for biomedical literature using SVM and rich feature sets , 2010, J. Biomed. Informatics.

[11]  Dong-Soo Han,et al.  Protein complex prediction based on simultaneous protein interaction network , 2010, Bioinform..

[12]  S. Aota,et al.  Molecular diversity of cell-matrix adhesions. , 1999, Journal of cell science.

[13]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from full texts , 2004, Bioinform..

[14]  Benjamin Geiger,et al.  Quantitative Multicolor Compositional Imaging Resolves Molecular Domains in Cell-Matrix Adhesions , 2008, PloS one.

[15]  Anton Yuryev,et al.  Extracting human protein interactions from MEDLINE using a full-sentence parser , 2004, Bioinform..

[16]  S. Itzkovitz,et al.  Functional atlas of the integrin adhesome , 2007, Nature Cell Biology.

[17]  Bohdan Schneider,et al.  A Biocurator Perspective: Annotation at the Research Collaboratory for Structural Bioinformatics Protein Data Bank , 2006, PLoS Comput. Biol..

[18]  E. Zamir,et al.  Components of cell-matrix adhesions. , 2001, Journal of cell science.

[19]  Alfonso Valencia,et al.  Implementing the iHOP concept for navigation of biomedical literature , 2005, ECCB/JBI.

[20]  Hisashi Kashima,et al.  Protein complex prediction via verifying and reconstructing the topology of domain-domain interactions , 2010, BMC Bioinformatics.

[21]  E. Zamir,et al.  Molecular complexity and dynamics of cell-matrix adhesions. , 2001, Journal of cell science.

[22]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[23]  Kara Dolinski,et al.  The BioGRID Interaction Database: 2011 update , 2010, Nucleic Acids Res..

[24]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[25]  Kyu-Chul Lee,et al.  Finding the evidence for protein-protein interactions from PubMed abstracts , 2006, ISMB.