A Protein Domain Co-Occurrence Network Approach for Predicting Protein Function and Inferring Species Phylogeny

Protein Domain Co-occurrence Network (DCN) is a biological network that has not been fully-studied. We analyzed the properties of the DCNs of H. sapiens, S. cerevisiae, C. elegans, D. melanogaster, and 15 plant genomes. These DCNs have the hallmark features of scale-free networks. We investigated the possibility of using DCNs to predict protein and domain functions. Based on our experiment conducted on 66 randomly selected proteins, the best of top 3 predictions made by our DCN-based aggregated neighbor-counting method achieved a semantic similarity score of 0.81 to the actual Gene Ontology terms of the proteins. Moreover, the top 3 predictions using neighbor-counting, χ2, and a SVM-based method achieved an accuracy of 66%, 59%, and 61%, respectively, when used to predict specific Gene Ontology terms of human target domains. These predictions on average had a semantic similarity score of 0.82, 0.80, and 0.79 to the actual Gene Ontology terms, respectively. We also used DCNs to predict whether a domain is an enzyme domain, and our SVM-based and neighbor-inference method correctly classified 79% and 77% of the target domains, respectively. When using DCNs to classify a target domain into one of the six enzyme classes, we found that, as long as there is one EC number available in the neighboring domains, our SVM-based and neighboring-counting method correctly classified 92.4% and 91.9% of the target domains, respectively. Furthermore, we benchmarked the performance of using DCNs to infer species phylogenies on six different combinations of 398 single-chromosome prokaryotic genomes. The phylogenetic tree of 54 prokaryotic taxa generated by our DCNs-alignment-based method achieved a 93.45% similarity score compared to the Bergey's taxonomy. In summary, our studies show that genome-wide DCNs contain rich information that can be effectively used to decipher protein function and reveal the evolutionary relationship among species.

[1]  Michael A. White,et al.  Use of Data-Biased Random Walks on Graphs for the Retrieval of Context-Specific Networks from Genomic Data , 2010, PLoS Comput. Biol..

[2]  L. Stein,et al.  A human functional protein interaction network and its application to cancer data analysis , 2010, Genome Biology.

[3]  Hisashi Kashima,et al.  Reaction graph kernels predict EC numbers of unknown enzymatic reactions in plant secondary metabolism , 2010, BMC Bioinformatics.

[4]  Peng Li,et al.  PerturbationAnalyzer: a tool for investigating the effects of concentration perturbation on protein interaction networks , 2010, Bioinform..

[5]  Charlotte M. Deane,et al.  Revisiting Date and Party Hubs: Novel Approaches to Role Assignment in Protein Interaction Networks , 2009, PLoS Comput. Biol..

[6]  Charlotte M. Deane,et al.  The function of communities in protein interaction networks at multiple scales , 2009, BMC Systems Biology.

[7]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[8]  Ferenc Jordán,et al.  A quantitative approach to study indirect effects among disease proteins in the human protein interaction network , 2010, BMC Systems Biology.

[9]  Philip S. Yu,et al.  G-SESAME: web tools for GO-term-based gene similarity analysis and knowledge discovery , 2009, Nucleic Acids Res..

[10]  Bonnie Berger,et al.  IsoRankN: spectral methods for global alignment of multiple protein networks , 2009, Bioinform..

[11]  Aidong Zhang,et al.  Protein Interaction Networks: Computational Analysis , 2009 .

[12]  Guohui Lin,et al.  ComPhy: prokaryotic composite distance phylogenies inferred from whole-genome gene sets , 2009, BMC Bioinformatics.

[13]  Sarah A. Teichmann,et al.  Protein domain organisation: adding order , 2009, BMC Bioinformatics.

[14]  Chao Zhang,et al.  An integrated probabilistic approach for gene function prediction using multiple sources of high-throughput data , 2008, Int. J. Comput. Biol. Drug Des..

[15]  Richard Bonneau Learning biological networks: from modules to dynamics. , 2008, Nature chemical biology.

[16]  Bonnie Berger,et al.  Global alignment of multiple protein interaction networks with application to functional orthology detection , 2008, Proceedings of the National Academy of Sciences.

[17]  E. Koonin,et al.  Evolution of protein domain promiscuity in eukaryotes. , 2008, Genome research.

[18]  Mei Liu,et al.  Protein Function Assignment through Mining Cross-Species Protein-Protein Interactions , 2008, PloS one.

[19]  S. Lovell,et al.  Protein-protein interaction networks and biology—what's the connection? , 2008, Nature Biotechnology.

[20]  Ji Qi,et al.  Prokaryote phylogeny meets taxonomy: An exhaustive comparison of composition vector trees with systematic bacteriology , 2007, Science in China Series C: Life Sciences.

[21]  Thomas Lengauer,et al.  Computational analysis of human protein interaction networks , 2007, Proteomics.

[22]  Randy Goebel,et al.  Nucleotide composition string selection in HIV-1 subtyping using whole genomes , 2007, Bioinform..

[23]  Lukas N. Mueller,et al.  An integrated mass spectrometric and computational framework for the analysis of protein interaction networks , 2007, Nature Biotechnology.

[24]  Jessica H. Fong,et al.  Modeling the evolution of protein domain architectures using maximum parsimony. , 2007, Journal of molecular biology.

[25]  John A. Hamilton,et al.  The TIGR Rice Genome Annotation Resource: improvements and new features , 2006, Nucleic Acids Res..

[26]  Cong Fu,et al.  BPhyOG: An interactive server for genome-wide inference of bacterial phylogenies based on overlapping genes , 2007, BMC Bioinformatics.

[27]  Jian Wang,et al.  Protein interaction networks of Saccharomyces cerevisiae, Caenorhabditis elegans and Drosophila melanogaster: Large‐scale organization and robustness , 2006, Proteomics.

[28]  Johannes Söding,et al.  The HHpred interactive server for protein homology detection and structure prediction , 2005, Nucleic Acids Res..

[29]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[30]  Roded Sharan,et al.  Efficient Algorithms for Detecting Signaling Pathways in Protein Interaction Networks , 2005, RECOMB.

[31]  Pierre Baldi,et al.  Sigmoid: a software infrastructure for pathway bioinformatics and systems biology , 2005, IEEE Intelligent Systems.

[32]  A. Elofsson,et al.  Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. , 2005, Journal of molecular biology.

[33]  Eric J. Deeds,et al.  Prokaryotic phylogenies inferred from protein structural domains. , 2005, Genome research.

[34]  Hans-Peter Kriegel,et al.  Protein function prediction via graph kernels , 2005, ISMB.

[35]  P. Baldi,et al.  Sigmoid : Towards an Intelligent , Scalable , Software Infrastructure for Pathway Bioinformatics and Systems Biology , 2005 .

[36]  Karsten M. Borgwardt,et al.  Kernel Methods for Protein Function Prediction , 2005 .

[37]  S. Wuchty,et al.  Evolutionary cores of domain co-occurrence networks , 2005, BMC Evolutionary Biology.

[38]  Arun K. Ramani,et al.  Protein interaction networks from yeast to human. , 2004, Current opinion in structural biology.

[39]  Y.Z. Chen,et al.  Enzyme family classification by support vector machines , 2004, Proteins.

[40]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[41]  Nello Cristianini,et al.  Kernel-Based Data Fusion and Its Application to Protein Function Prediction in Yeast , 2003, Pacific Symposium on Biocomputing.

[42]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[43]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[44]  J. Whisstock,et al.  Prediction of protein function from protein sequence and structure , 2003, Quarterly Reviews of Biophysics.

[45]  Søren Brunak,et al.  Prediction of human protein function according to Gene Ontology categories , 2003, Bioinform..

[46]  Hiroaki Kitano,et al.  The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models , 2003, Bioinform..

[47]  Jungwon Yoon,et al.  The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community , 2003, Nucleic Acids Res..

[48]  H. Kitano,et al.  Computational systems biology , 2002, Nature.

[49]  G. Church,et al.  Analysis of optimality in natural and perturbed metabolic networks , 2002 .

[50]  Jérôme Gouzy,et al.  ProDom: Automated Clustering of Homologous Domains , 2002, Briefings Bioinform..

[51]  A. Barabasi,et al.  Hierarchical Organization of Modularity in Metabolic Networks , 2002, Science.

[52]  C. Ball,et al.  Saccharomyces Genome Database. , 2002, Methods in enzymology.

[53]  S. Wuchty Scale-free behavior in protein domain networks. , 2001, Molecular biology and evolution.

[54]  T. Takagi,et al.  Assessment of prediction accuracy of protein function from protein–protein interaction data , 2001, Yeast.

[55]  R. Ozawa,et al.  A comprehensive two-hybrid analysis to explore the yeast protein interactome , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[56]  T. Ideker,et al.  A new approach to decoding life: systems biology. , 2001, Annual review of genomics and human genetics.

[57]  B. Schwikowski,et al.  A network of protein–protein interactions in yeast , 2000, Nature Biotechnology.

[58]  R. Albert,et al.  The large-scale organization of metabolic networks , 2000, Nature.

[59]  A. Barabasi,et al.  Error and attack tolerance of complex networks , 2000, Nature.

[60]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[61]  James R. Knight,et al.  A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae , 2000, Nature.

[62]  M. Elowitz,et al.  A synthetic oscillatory network of transcriptional regulators , 2000, Nature.

[63]  J. Hopfield,et al.  From molecular to modular cell biology , 1999, Nature.

[64]  Nello Cristianini,et al.  Advances in Kernel Methods - Support Vector Learning , 1999 .

[65]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[66]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[67]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[68]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[69]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[70]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[71]  S. T. Cowan Bergey's Manual of Determinative Bacteriology , 1948, Nature.

[72]  James T. Staley,et al.  Bergey's Manual of Determinative Bacteriology , 1939 .