Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks

BackgroundUncovering cellular roles of a protein is a task of tremendous importance and complexity that requires dedicated experimental work as well as often sophisticated data mining and processing tools. Protein functions, often referred to as its annotations, are believed to manifest themselves through topology of the networks of inter-proteins interactions. In particular, there is a growing body of evidence that proteins performing the same function are more likely to interact with each other than with proteins with other functions. However, since functional annotation and protein network topology are often studied separately, the direct relationship between them has not been comprehensively demonstrated. In addition to having the general biological significance, such demonstration would further validate the data extraction and processing methods used to compose protein annotation and protein-protein interactions datasets.ResultsWe developed a method for automatic extraction of protein functional annotation from scientific text based on the Natural Language Processing (NLP) technology. For the protein annotation extracted from the entire PubMed, we evaluated the precision and recall rates, and compared the performance of the automatic extraction technology to that of manual curation used in public Gene Ontology (GO) annotation. In the second part of our presentation, we reported a large-scale investigation into the correspondence between communities in the literature-based protein networks and GO annotation groups of functionally related proteins. We found a comprehensive two-way match: proteins within biological annotation groups form significantly denser linked network clusters than expected by chance and, conversely, densely linked network communities exhibit a pronounced non-random overlap with GO groups. We also expanded the publicly available GO biological process annotation using the relations extracted by our NLP technology. An increase in the number and size of GO groups without any noticeable decrease of the link density within the groups indicated that this expansion significantly broadens the public GO annotation without diluting its quality. We revealed that functional GO annotation correlates mostly with clustering in a physical interaction protein network, while its overlap with indirect regulatory network communities is two to three times smaller.ConclusionProtein functional annotations extracted by the NLP technology expand and enrich the existing GO annotation system. The GO functional modularity correlates mostly with the clustering in the physical interaction network, suggesting that the essential role of structural organization maintained by these interactions. Reciprocally, clustering of proteins in physical interaction networks can serve as an evidence for their functional similarity.

[1]  Ji-Hoon Lee,et al.  ATM Activation by DNA Double-Strand Breaks Through the Mre11-Rad50-Nbs1 Complex , 2005, Science.

[2]  D. Eisenberg,et al.  Inference of protein function from protein structure. , 2005, Structure.

[3]  Søren Brunak,et al.  Prediction of human protein function according to Gene Ontology categories , 2003, Bioinform..

[4]  Stan Matwin,et al.  Functional Annotation of Genes Using Hierarchical Text Categorization , 2005 .

[5]  M. Sternberg,et al.  Automated prediction of protein function and detection of functional sites from structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Kaisheng Chen,et al.  In silico gene function prediction using ontology-based pattern identification , 2005, Bioinform..

[7]  R. Tsien,et al.  Specificity and Stability in Topology of Protein Networks , 2022 .

[8]  Boris Hayete,et al.  GOTrees: Predicting GO Associations from Protein Domain Composition Using Decision Trees , 2004, Pacific Symposium on Biocomputing.

[9]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984, ACL.

[10]  Anton Yuryev,et al.  Research Paper: A Simple and Practical Dictionary-based Approach for Identification of Proteins in Medline Abstracts , 2004, J. Am. Medical Informatics Assoc..

[11]  J. Hopfield,et al.  From molecular to modular cell biology , 1999, Nature.

[12]  M. Vignali,et al.  A protein interaction network of the malaria parasite Plasmodium falciparum , 2005, Nature.

[13]  L. Mirny,et al.  Protein complexes and functional modules in molecular networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Jan Komorowski,et al.  Learning Rule-based Models of Biological Process from Gene Expression Time Profiles Using Gene Ontology , 2003, Bioinform..

[15]  Toshihisa Takagi,et al.  Data and text mining Automatic extraction of gene / protein biological functions from biomedical text , 2005 .

[16]  J. Schug,et al.  Predicting gene ontology functions from ProDom and CDD protein domains. , 2002, Genome research.

[17]  Jung-Hsien Chiang,et al.  Extracting Functional Annotations of Proteins Based on Hybrid Text Mining Approaches , 2004 .

[18]  Emily Dimmer,et al.  An evaluation of GO annotation retrieval for BioCreAtIvE and GOA , 2005, BMC Bioinformatics.

[19]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[20]  I. Ispolatov,et al.  Finding mesoscopic communities in sparse networks , 2005, Journal of statistical mechanics.

[21]  Stanley Letovsky,et al.  Predicting protein function from protein/protein interaction data: a probabilistic approach , 2003, ISMB.

[22]  Stefan Bornholdt,et al.  Detecting fuzzy community structures in complex networks with a Potts model. , 2004, Physical review letters.

[23]  Mark Craven,et al.  Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text , 2005, BMC Bioinformatics.

[24]  Kuo-Chen Chou,et al.  Predicting protein localization in budding Yeast , 2005, Bioinform..

[25]  C. Sander,et al.  Growth in Bioinformatics , 2003, Bioinform..

[26]  Ting Chen,et al.  Mapping gene ontology to proteins based on protein-protein interaction data , 2004, Bioinform..

[27]  Anton Yuryev,et al.  Extracting Protein Function Information from MEDLINE Using a Full-Sentence Parser , 2004 .

[28]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[29]  Ying Xu,et al.  Prediction of functional modules based on comparative genome analysis and Gene Ontology application , 2005, Nucleic acids research.

[30]  Anton Yuryev,et al.  Extracting human protein interactions from MEDLINE using a full-sentence parser , 2004, Bioinform..

[31]  Alfonso Valencia,et al.  Evaluation of BioCreAtIvE assessment of task 2 , 2005, BMC Bioinformatics.

[32]  Jung-Hsien Chiang,et al.  MeKE: Discovering the Functions of Gene Products from Biomedical Literature Via Sentence Alignment , 2003, Bioinform..

[33]  Anton J. Enright,et al.  Detection of functional modules from protein interaction networks , 2003, Proteins.

[34]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[35]  Roland Eils,et al.  Applying Support Vector Machines for Gene ontology based gene function prediction , 2004, BMC Bioinformatics.

[36]  Igor Jurisica,et al.  Functional topology in a network of protein interactions , 2004, Bioinform..

[37]  Paul A. Bates,et al.  Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis , 2006, BMC Bioinformatics.

[38]  Sergei Maslov,et al.  Automatic Pathway Building in Biological Association Networks , 2006 .

[39]  K. Sneppen,et al.  Specificity and Stability in Topology of Protein Networks , 2002, Science.

[40]  Patrick Ruch,et al.  Data-poor categorization and passage retrieval for Gene Ontology Annotation in Swiss-Prot , 2005, BMC Bioinformatics.

[41]  Sean R. Collins,et al.  Global landscape of protein complexes in the yeast Saccharomyces cerevisiae , 2006, Nature.

[42]  K. Tatsumi,et al.  Aggregative organization enhances the DNA end‐joining process that is mediated by DNA‐dependent protein kinase , 2006, The FEBS journal.

[43]  Zheng Guo,et al.  Broadly predicting specific gene functions with expression similarity and taxonomy similarity. , 2005, Gene.

[44]  Hans Lehrach,et al.  Automated Gene Ontology annotation for anonymous sequence data , 2003, Nucleic Acids Res..

[45]  Jan Komorowski,et al.  Predicting Gene Function from Gene Expressions and Ontologies , 2000, Pacific Symposium on Biocomputing.

[46]  Goran Nenadic,et al.  Mining protein function from text using term-based support vector machines , 2005, BMC Bioinformatics.

[47]  Olivier Poch,et al.  GOAnno: GO annotation based on multiple alignment , 2005, Bioinform..

[48]  Mário J. Silva,et al.  Finding genomic ontology terms in text using evidence content , 2005, BMC Bioinformatics.

[49]  Karin M. Verspoor,et al.  Protein annotation as term categorization in the gene ontology using word proximity networks , 2005, BMC Bioinformatics.

[50]  I. Ispolatov,et al.  Binding properties and evolution of homodimers in protein–protein interaction networks , 2005, Nucleic acids research.

[51]  Jan Komorowski,et al.  Predicting gene ontology biological process from temporal gene expression patterns. , 2003, Genome research.

[52]  D. Stern,et al.  Regulation of CHK2 by DNA-dependent Protein Kinase* , 2005, Journal of Biological Chemistry.