Unexpected features of the dark proteome

Significance A key remaining frontier in our understanding of biological systems is the “dark proteome”—that is, the regions of proteins where molecular conformation is completely unknown. We systematically surveyed these regions, finding that nearly half of the proteome in eukaryotes is dark and that, surprisingly, most of the darkness cannot be accounted for. We also found that the dark proteome has unexpected features, including an association with secretory tissues, disulfide bonding, low evolutionary conservation, and very few known interactions with other proteins. This work will help future research shed light on the remaining dark proteome, thus revealing molecular processes of life that are currently unknown. We surveyed the “dark” proteome–that is, regions of proteins never observed by experimental structure determination and inaccessible to homology modeling. For 546,000 Swiss-Prot proteins, we found that 44–54% of the proteome in eukaryotes and viruses was dark, compared with only ∼14% in archaea and bacteria. Surprisingly, most of the dark proteome could not be accounted for by conventional explanations, such as intrinsic disorder or transmembrane regions. Nearly half of the dark proteome comprised dark proteins, in which the entire sequence lacked similarity to any known structure. Dark proteins fulfill a wide variety of functions, but a subset showed distinct and largely unexpected features, such as association with secretion, specific tissues, the endoplasmic reticulum, disulfide bonding, and proteolytic cleavage. Dark proteins also had short sequence length, low evolutionary reuse, and few known interactions with other proteins. These results suggest new research directions in structural and computational biology.

[1]  Bernard W. Silverman,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[2]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[3]  Ben Shneiderman,et al.  Tree visualization with tree-maps: 2-d space-filling approach , 1992, TOGS.

[4]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[5]  C. Chothia Proteins. One thousand families for the molecular biologist. , 1992, Nature.

[6]  B. Rost,et al.  Transmembrane helices predicted at 95% accuracy , 1995, Protein science : a publication of the Protein Society.

[7]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[8]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[9]  P. Aloy,et al.  Relation between amino acid composition and cellular location of proteins. , 1997, Journal of molecular biology.

[10]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[11]  B. Rost,et al.  Adaptation of protein surfaces to subcellular location. , 1998, Journal of molecular biology.

[12]  J. Drake,et al.  Rates of spontaneous mutation. , 1998, Genetics.

[13]  A. Sali,et al.  Structural genomics: beyond the Human Genome Project , 1999, Nature Genetics.

[14]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[15]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[16]  Christopher J. Oldfield,et al.  Intrinsically disordered protein. , 2001, Journal of molecular graphics & modelling.

[17]  Biological dark matter: Newfound RNA suggests a hidden complexity inside cells , 2002 .

[18]  E. Koonin,et al.  The structure of the protein universe and genome evolution , 2002, Nature.

[19]  Melanie A. Huntley,et al.  Simple sequences are rare in the Protein Data Bank , 2002, Proteins.

[20]  J. Mattick Challenging the dogma: the hidden layer of non-protein-coding RNAs in complex organisms. , 2003, BioEssays : news and reviews in molecular, cellular and developmental biology.

[21]  J. S. Sodhi,et al.  Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. , 2004, Journal of molecular biology.

[22]  Edward M. Rubin,et al.  Metagenomics: DNA sequencing of environmental samples , 2005, Nature Reviews Genetics.

[23]  G. Bertone,et al.  Particle dark matter: Evidence, candidates and constraints , 2004, hep-ph/0404175.

[24]  Zsuzsanna Dosztányi,et al.  IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content , 2005, Bioinform..

[25]  Christine A. Orengo,et al.  Towards a comprehensive structural coverage of completed genomes: a structural genomics viewpoint , 2007, BMC Bioinformatics.

[26]  Burkhard Rost,et al.  PROFtmb: a web server for predicting bacterial transmembrane beta barrel proteins , 2006, Nucleic Acids Res..

[27]  Leszek Rychlewski,et al.  The challenge of protein structure determination—lessons from structural genomics , 2007, Protein science : a publication of the Protein Society.

[28]  Yong Zhang,et al.  CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine , 2007, Nucleic Acids Res..

[29]  E. Carpenter,et al.  Overcoming the challenges of membrane protein crystallography , 2008, Current opinion in structural biology.

[30]  William R Taylor,et al.  Probing the "dark matter" of protein fold space. , 2009, Structure.

[31]  Marco Punta,et al.  Structural genomics target selection for the New York consortium on membrane protein structure , 2009, Journal of Structural and Functional Genomics.

[32]  Avner Schlessinger,et al.  Improved Disorder Prediction by Combination of Orthogonal Approaches , 2009, PloS one.

[33]  M. Levitt Nature of the protein universe , 2009, Proceedings of the National Academy of Sciences.

[34]  E. Birney,et al.  Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt , 2009, Nature Protocols.

[35]  Norman E. Davey,et al.  How viruses hijack cell regulation. , 2011, Trends in biochemical sciences.

[36]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[37]  Juergen Haas,et al.  The Protein Model Portal—a comprehensive resource for protein structure and model information , 2013, Database J. Biol. Databases Curation.

[38]  Damian Szklarczyk,et al.  STRING v9.1: protein-protein interaction networks, with increased coverage and integration , 2012, Nucleic Acids Res..

[39]  A Keith Dunker,et al.  An assignment of intrinsically disordered regions of proteins based on NMR structures. , 2013, Journal of structural biology.

[40]  Johannes Söding,et al.  kClust: fast and sensitive clustering of large protein sequence databases , 2013, BMC Bioinformatics.

[41]  Charlotte M. Deane,et al.  Exploring Fold Space Preferences of New-born and Ancient Protein Superfamilies , 2013, PLoS Comput. Biol..

[42]  A Keith Dunker,et al.  Utilization of protein intrinsic disorder knowledge in structural proteomics. , 2013, Biochimica et biophysica acta.

[43]  V. Uversky Intrinsically Disordered Proteins , 2014 .

[44]  Christian Stolte,et al.  COMPARTMENTS: unification and visualization of protein subcellular localization evidence , 2014, Database J. Biol. Databases Curation.

[45]  Andras Fiser,et al.  Trends in structural coverage of the protein universe and the impact of the Protein Structure Initiative , 2014, Proceedings of the National Academy of Sciences.

[46]  Gem Stapleton,et al.  Visualizing Sets: An Empirical Comparison of Diagram Types , 2014, Diagrams.

[47]  María Martín,et al.  Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..

[48]  Sergey Nepomnyachiy,et al.  Global view of the protein universe , 2014, Proceedings of the National Academy of Sciences.

[49]  A Keith Dunker,et al.  Intrinsically disordered proteins and intrinsically disordered protein regions. , 2014, Annual review of biochemistry.

[50]  Fabian A. Buske,et al.  Aquaria: simplifying discovery and insight from protein structures , 2015, Nature Methods.

[51]  José Ignacio Garzón,et al.  Template-based prediction of protein function. , 2015, Current opinion in structural biology.