Integrating protein-protein interactions and text mining for protein function prediction

BackgroundFunctional annotation of proteins remains a challenging task. Currently the scientific literature serves as the main source for yet uncurated functional annotations, but curation work is slow and expensive. Automatic techniques that support this work are still lacking reliability. We developed a method to identify conserved protein interaction graphs and to predict missing protein functions from orthologs in these graphs. To enhance the precision of the results, we furthermore implemented a procedure that validates all predictions based on findings reported in the literature.ResultsUsing this procedure, more than 80% of the GO annotations for proteins with highly conserved orthologs that are available in UniProtKb/Swiss-Prot could be verified automatically. For a subset of proteins we predicted new GO annotations that were not available in UniProtKb/Swiss-Prot. All predictions were correct (100% precision) according to the verifications from a trained curator.ConclusionOur method of integrating CCSs and literature mining is thus a highly reliable approach to predict GO annotations for weakly characterized proteins with orthologs.

[1]  Adam J. Smith,et al.  The Database of Interacting Proteins: 2004 update , 2004, Nucleic Acids Res..

[2]  Arno Siebes,et al.  Data and text mining Combination of text-mining algorithms increases the performance , 2006 .

[3]  Hagit Shatkay,et al.  SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. , 2007, Bioinformatics.

[4]  L. Mirny,et al.  Protein complexes and functional modules in molecular networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[5]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[6]  R. Sharan,et al.  Network-based prediction of protein function , 2007, Molecular systems biology.

[7]  B. Snel,et al.  Function prediction and protein networks. , 2003, Current opinion in cell biology.

[8]  Marie-Claude Roland,et al.  Publish and perish , 2007, EMBO reports.

[9]  R. Karp,et al.  From the Cover : Conserved patterns of protein interaction in multiple species , 2005 .

[10]  Dietrich Rebholz-Schuhmann,et al.  Annotation and Disambiguation of Semantic Types in Biomedical Text: A Cascaded Approach to Named Entity Recognition , 2006, NLPXML@EACL.

[11]  Dietrich Rebholz-Schuhmann,et al.  EBIMed - text crunching to gather facts for proteins from Medline , 2007, Bioinform..

[12]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt): an expanding universe of protein information , 2005, Nucleic Acids Res..

[13]  Hanno Steen,et al.  Development of human protein reference database as an initial platform for approaching systems biology in humans. , 2003, Genome research.

[14]  Sergei Maslov,et al.  Automatic Pathway Building in Biological Association Networks , 2006 .

[15]  Dietrich Rebholz-Schuhmann,et al.  Combining Evidence, Specificity, and Proximity towards the Normalization of Gene Ontology Terms in Text , 2008, EURASIP J. Bioinform. Syst. Biol..

[16]  Dietrich Rebholz-Schuhmann,et al.  Text processing through Web services: calling Whatizit , 2008, Bioinform..

[17]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[18]  Ian M. Donaldson,et al.  BIND: the Biomolecular Interaction Network Database , 2001, Nucleic Acids Res..

[19]  Rolf Apweiler,et al.  GOAnnotator: linking protein GO annotations to evidence text , 2006, Journal of biomedical discovery and collaboration.

[20]  Martin Vingron,et al.  IntAct: an open source molecular interaction database , 2004, Nucleic Acids Res..

[21]  P. Uetz,et al.  What do we learn from high-throughput protein interaction data? , 2004, Expert review of proteomics.

[22]  Hongfang Liu,et al.  BioThesaurus: a web-based thesaurus of protein and gene names , 2006, Bioinform..

[23]  angesichts der Corona-Pandemie,et al.  UPDATE , 1973, The Lancet.

[24]  Gene Ontology Consortium,et al.  The Gene Ontology (GO) project in 2006 , 2005, Nucleic Acids Res..

[25]  Gaurav Pandey,et al.  Computational Approaches for Protein Function Prediction : A Survey , 2006 .

[26]  PagelPhilipp,et al.  The MIPS mammalian protein--protein interaction database , 2005 .

[27]  Gary D Bader,et al.  BIND--The Biomolecular Interaction Network Database. , 2001, Nucleic acids research.

[28]  Patrick Ruch,et al.  Automatic assignment of biomedical categories: toward a generic approach , 2006, Bioinform..

[29]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..

[30]  J. Fetrow,et al.  Sequence- and structure-based protein function prediction from genomic information. , 2001, Current opinion in drug discovery & development.

[31]  M. Vidal,et al.  Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". , 2001, Genome research.

[32]  Ulf Leser,et al.  High-Precision Function Prediction using Conserved Interactions , 2007, German Conference on Bioinformatics.

[33]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[34]  Mário J. Silva,et al.  Measuring semantic similarity between Gene Ontology terms , 2007, Data Knowl. Eng..

[35]  Mário J. Silva,et al.  Finding genomic ontology terms in text using evidence content , 2005, BMC Bioinformatics.