Objective: biochemical function

DNA sequencing enables the discovery of new genes in high-throughput, low-cost experiments. Conversely, gene function is determined by low-throughput, high-cost experiments. This inverse relationship for these two types of data is a major impediment in meeting one of the major scientific challenges of our time—the understanding of genomes. This mismatch in throughput is illustrated by considering the progress made for one of the earliest sequenced genomes, that of Mycobacterium tuberculosis H37Rv (Mtb). When its genome was published in 1998, more than a quarter of its genes had no known function (Cole et al., 1998). Our lack of knowledge about these approximately 1000 “conserved hypothetical” genes in Mtb represents a serious deficiency in our understanding of its biology. Now, after more than a decade of progress, our knowledge of those proteins' functions is essentially unchanged—there are still greater than 900 genes with no known function (Lew et al., 2011). In contrast, during this same period, the scientific community has sequenced approximately 18,000 new genomes (Pagani et al., 2012), containing millions of new hypothetical proteins. Apparently, the vector of our progress has tipped decisively away from data interpretation and comprehension, and toward mere data collection. To address the issue of gene function testing and annotation for all microbes, we founded COMBREX (COMputational BRidge to EXperiments), an endeavor aimed at accelerating the rate of gene function validation (Anton et al., 2013). Two of COMBREX's more prominent initiatives were the creation of a comprehensive database for protein function data (http://combrex.bu.edu), and the deployment of a crowdsourcing platform to catalyze protein function experimentation. In the course of these two efforts, it became apparent that fundamental changes in approaches to the problem of protein function determination were needed if there was any hope of keeping pace with DNA sequencing. We suggest that the community work together to (1) re-establish the connection between existing gene annotation and the foundational experimental data that supports all annotation, (2) develop experiment design principles to help guide the identification of maximally informative targets for function validation, (3) invest in the development of higher-throughput approaches for the testing of protein function, and (4) provide an expedited publication pathway for reporting experimental results of gene function, analogous to the reporting of newly sequenced genomes in the journal “Standards in Genomic Sciences.”

[1]  B. Cravatt,et al.  Activity-based protein profiling: from enzyme chemistry to proteomic chemistry. , 2008, Annual review of biochemistry.

[2]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[3]  Gary D Bader,et al.  A draft map of the human proteome , 2014, Nature.

[4]  Henry Lin,et al.  Thousands of missed genes found in bacterial genomes and their analysis with COMBREX , 2012, Biology Direct.

[5]  Stanley Letovsky,et al.  Predicting protein function from protein/protein interaction data: a probabilistic approach , 2003, ISMB.

[6]  B. Cravatt,et al.  Activity-based Proteomics of Enzyme Superfamilies: Serine Hydrolases as a Case Study* , 2010, The Journal of Biological Chemistry.

[7]  Cheryl H Arrowsmith,et al.  Enzyme genomics: Application of general enzymatic screens to discover new enzymes. , 2005, FEMS microbiology reviews.

[8]  Brittany J. Gasper,et al.  Small World Initiative: crowdsourcing research of new antibiotics to enhance undergraduate biology teaching (618.41) , 2014 .

[9]  Adamandia Kapopoulou,et al.  TubercuList--10 years after. , 2011, Tuberculosis.

[10]  Michael Y. Galperin,et al.  The COMBREX Project: Design, Methodology, and Initial Results , 2013, PLoS biology.

[11]  Ian K. Blaby,et al.  The archaeal COG1901/DUF358 SPOUT-methyltransferase members, together with pseudouridine synthase Pus10, catalyze the formation of 1-methylpseudouridine at position 54 of tRNA. , 2012, RNA.

[12]  B. Barrell,et al.  Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence , 1998, Nature.

[13]  V. de Crécy-Lagard,et al.  Diversity of archaeosine synthesis in crenarchaeota. , 2012, ACS chemical biology.

[14]  D. Söll,et al.  Selenomodification of tRNA in archaea requires a bipartite rhodanese enzyme , 2012, FEBS letters.

[15]  Cheryl H Arrowsmith,et al.  High throughput screening of purified proteins for enzymatic activity. , 2008, Methods in molecular biology.

[16]  R. Morgan,et al.  Characterization of Type II and III Restriction-Modification Systems from Bacillus cereus Strains ATCC 10987 and ATCC 14579 , 2011, Journal of bacteriology.

[17]  Amos Bairoch,et al.  Metrics for the Human Proteome Project 2013-2014 and strategies for finding missing proteins. , 2014, Journal of proteome research.

[18]  I-Min A. Chen,et al.  The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata , 2007, Nucleic Acids Res..

[19]  I-Min A. Chen,et al.  The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata , 2011, Nucleic Acids Res..

[20]  L. Columbus,et al.  A broad specificity nucleoside kinase from Thermoplasma acidophilum , 2013, Proteins.

[21]  Simon Kasif,et al.  Biochemical Characterization of Hypothetical Proteins from Helicobacter pylori , 2013, PloS one.

[22]  N. Grishin,et al.  Tagaturonate-fructuronate epimerase UxaE, a novel enzyme in the hexuronate catabolic network in Thermotoga maritima. , 2012, Environmental microbiology.

[23]  G. Gadda,et al.  A novel activity for fungal nitronate monooxygenase: detoxification of the metabolic inhibitor propionate-3-nitronate. , 2012, Archives of biochemistry and biophysics.

[24]  Richard J. Roberts,et al.  Characterization of DNA methyltransferase specificities using single-molecule, real-time DNA sequencing , 2011, Nucleic acids research.