Finding the “Dark Matter” in Human and Yeast Protein Network Prediction and Modelling

Accurate modelling of biological systems requires a deeper and more complete knowledge about the molecular components and their functional associations than we currently have. Traditionally, new knowledge on protein associations generated by experiments has played a central role in systems modelling, in contrast to generally less trusted bio-computational predictions. However, we will not achieve realistic modelling of complex molecular systems if the current experimental designs lead to biased screenings of real protein networks and leave large, functionally important areas poorly characterised. To assess the likelihood of this, we have built comprehensive network models of the yeast and human proteomes by using a meta-statistical integration of diverse computationally predicted protein association datasets. We have compared these predicted networks against combined experimental datasets from seven biological resources at different level of statistical significance. These eukaryotic predicted networks resemble all the topological and noise features of the experimentally inferred networks in both species, and we also show that this observation is not due to random behaviour. In addition, the topology of the predicted networks contains information on true protein associations, beyond the constitutive first order binary predictions. We also observe that most of the reliable predicted protein associations are experimentally uncharacterised in our models, constituting the hidden or “dark matter” of networks by analogy to astronomical systems. Some of this dark matter shows enrichment of particular functions and contains key functional elements of protein networks, such as hubs associated with important functional areas like the regulation of Ras protein signal transduction in human cells. Thus, characterising this large and functionally important dark matter, elusive to established experimental designs, may be crucial for modelling biological systems. In any case, these predictions provide a valuable guide to these experimentally elusive regions.

[1]  M. Newman,et al.  Mixing patterns in networks. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[2]  William Stafford Noble,et al.  Support vector machine , 2013 .

[3]  Berend Snel,et al.  Quantifying modularity in the evolution of biomolecular systems. , 2004, Genome research.

[4]  Ben Lehner,et al.  Modelling genotype–phenotype relationships and human disease with genetic interaction networks , 2007, Journal of Experimental Biology.

[5]  Michael J Gagen,et al.  Accelerating Networks , 2005, Science.

[6]  Dipanwita Roy Chowdhury,et al.  Human protein reference database as a discovery resource for proteomics , 2004, Nucleic Acids Res..

[7]  Burkhard Rost,et al.  Protein–Protein Interactions More Conserved within Species than across Species , 2006, PLoS Comput. Biol..

[8]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[9]  Sailu Yellaboina,et al.  Inferring genome-wide functional linkages in E. coli by combining improved genome context methods: comparison with high-throughput experimental data. , 2007, Genome research.

[10]  Israel Steinfeld,et al.  BMC Bioinformatics BioMed Central , 2008 .

[11]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[12]  Charles DeLisi,et al.  High-precision high-coverage functional inference from integrated data sources , 2008, BMC Bioinformatics.

[13]  J. Halton A Retrospective and Prospective Survey of the Monte Carlo Method , 1970 .

[14]  R. Ellis,et al.  Dark matter maps reveal cosmic scaffolding , 2007, Nature.

[15]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[16]  Gerardo Hermosillo,et al.  Supervised learning from multiple experts: whom to trust when everyone lies a bit , 2009, ICML '09.

[17]  Matteo Pellegrini,et al.  Prolinks: a database of protein functional linkages derived from coevolution , 2004, Genome Biology.

[18]  Christine A. Orengo,et al.  Predicting Protein Function with Hierarchical Phylogenetic Profiles: The Gene3D Phylo-Tuner Method Applied to Eukaryotic Genomes , 2007, PLoS Comput. Biol..

[19]  H. Pearson Surviving a knockout blow , 2002, Nature.

[20]  T. Sittler,et al.  The Plasmodium protein network diverges from those of other eukaryotes , 2005, Nature.

[21]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[22]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[23]  S. Colgan,et al.  Physiological roles for ecto-5’-nucleotidase (CD73) , 2006, Purinergic Signalling.

[24]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[25]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[26]  Christian von Mering,et al.  STRING 8—a global view on proteins and their functional interactions in 630 organisms , 2008, Nucleic Acids Res..

[27]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[28]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[29]  Nigel J. Martin,et al.  Gene3D: comprehensive structural and functional annotation of genomes , 2007, Nucleic Acids Res..

[30]  S. L. Wong,et al.  Towards a proteome-scale map of the human protein–protein interaction network , 2005, Nature.

[31]  M. Vidal,et al.  Literature-curated protein interaction datasets , 2009, Nature Methods.

[32]  Frances M. G. Pearl,et al.  The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution , 2006, Nucleic Acids Res..

[33]  Yoshihiro Yamanishi,et al.  KEGG for linking genomes to life and the environment , 2007, Nucleic Acids Res..

[34]  A. Barabasi,et al.  High-Quality Binary Protein Interaction Map of the Yeast Interactome Network , 2008, Science.

[35]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[36]  Y. Zhang,et al.  IntAct—open source resource for molecular interaction data , 2006, Nucleic Acids Res..

[37]  Igor Jurisica,et al.  Online Predicted Human Interaction Database , 2005, Bioinform..

[38]  M. Gerstein,et al.  Assessing the limits of genomic data integration for predicting protein networks. , 2005, Genome research.

[39]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[40]  Carole A. Goble,et al.  Semantic Similarity Measures as Tools for Exploring the Gene Ontology , 2002, Pacific Symposium on Biocomputing.

[41]  Sonia M Leach,et al.  The topology of the bacterial co-conserved protein network and its implications for predicting protein function , 2008, BMC Genomics.

[42]  Albert-László Barabási,et al.  Hierarchical organization in complex networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[43]  Alfonso Valencia,et al.  Protein co-evolution, co-adaptation and interactions , 2008, The EMBO journal.

[44]  Allan Birnbaum,et al.  Combining Independent Tests of Significance , 1954 .

[45]  Hamid Bolouri,et al.  A data integration methodology for systems biology. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[46]  Maria Victoria Schneider,et al.  MINT: a Molecular INTeraction database. , 2002, FEBS letters.

[47]  W. F. Bauer,et al.  The Monte Carlo Method , 1958 .

[48]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[49]  Rolf Apweiler,et al.  The Integr8 project - a resource for genomic and proteomic data , 2004, Silico Biol..

[50]  M E J Newman Assortative mixing in networks. , 2002, Physical review letters.

[51]  A. Grigoriev A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae. , 2001, Nucleic acids research.

[52]  R. Russell,et al.  Targeting and tinkering with interaction networks. , 2008, Nature chemical biology.

[53]  Lincoln Stein,et al.  Reactome knowledgebase of human biological pathways and processes , 2008, Nucleic Acids Res..

[54]  Berend Snel,et al.  Protein Complex Evolution Does Not Involve Extensive Network Rewiring , 2008, PLoS Comput. Biol..

[55]  Helen E. Parkinson,et al.  ArrayExpress—a public database of microarray experiments and gene expression profiles , 2006, Nucleic Acids Res..

[56]  Andrew J. Bulpitt,et al.  A Primer on Learning in Bayesian Networks for Computational Biology , 2007, PLoS Comput. Biol..

[57]  Gunnar Rätsch,et al.  Support Vector Machines and Kernels for Computational Biology , 2008, PLoS Comput. Biol..

[58]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[59]  A. Barabasi,et al.  Hierarchical Organization of Modularity in Metabolic Networks , 2002, Science.

[60]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.