Solving the Problem: Genome Annotation Standards before the Data Deluge

The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries.

[1]  Peter D. Karp,et al.  EcoCyc: a comprehensive database of Escherichia coli biology , 2010, Nucleic Acids Res..

[2]  Richard J. Roberts,et al.  COMBREX: a project to accelerate the functional annotation of prokaryotic genomes , 2010, Nucleic Acids Res..

[3]  Tin Wee Tan,et al.  Towards BioDBcore: a community-defined information specification for biological databases , 2010, Database J. Biol. Databases Curation.

[4]  Ying Cheng,et al.  The European Nucleotide Archive , 2010, Nucleic Acids Res..

[5]  J. Bertranpetit,et al.  The annotation and the usage of scientific databases could be improved with public issue tracker software , 2010, Database J. Biol. Databases Curation.

[6]  Nikos Kyrpides,et al.  Meeting Report: Towards a Critical Assessment of Functional Annotation Experiment (CAFAE) for bacterial genome annotation , 2010, Standards in genomic sciences.

[7]  Q. Zeng,et al.  Towards Viral Genome Annotation Standards, Report from the 2010 NCBI Annotation Workshop , 2010, Viruses.

[8]  Howard Ochman,et al.  The Extinction Dynamics of Bacterial Pseudogenes , 2010, PLoS genetics.

[9]  Mark Borodovsky,et al.  Genetack: frameshift Identification in protein-Coding Sequences by the Viterbi Algorithm , 2010, J. Bioinform. Comput. Biol..

[10]  R. Hai,et al.  Complete Genome Sequences of Yersinia pestis from Natural Foci in China , 2010, Journal of bacteriology.

[11]  Jennifer L. Harrow,et al.  Meeting report: a workshop on Best Practices in Genome Annotation , 2010, Database J. Biol. Databases Curation.

[12]  Seth Schobel,et al.  The Protein Naming Utility: a rules database for protein nomenclature , 2010, Nucleic Acids Res..

[13]  Gipsi Lima-Mendez,et al.  ACLAME: A CLAssification of Mobile genetic Elements, update 2010 , 2009, Nucleic Acids Res..

[14]  Giorgio Valle,et al.  The Gene Ontology in 2010: extensions and refinements , 2009, Nucleic Acids Res..

[15]  Takashi Gojobori,et al.  DDBJ launches a new archive database with analytical tools for next-generation sequence data , 2009, Nucleic Acids Res..

[16]  I-Min A. Chen,et al.  The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata , 2007, Nucleic Acids Res..

[17]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[18]  R. Lenski,et al.  Genome sequences of Escherichia coli B strains REL606 and BL21(DE3). , 2009, Journal of molecular biology.

[19]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[20]  John Quackenbush,et al.  Data reporting standards: making the things we use better , 2009, Genome Medicine.

[21]  N. Kyrpides Fifteen years of microbial genomics: meeting the challenges and fulfilling the dream , 2009, Nature Biotechnology.

[22]  N. Moran,et al.  Origin of an Alternative Genetic Code in the Extremely Small and GC–Rich Genome of a Bacterial Symbiont , 2009, PLoS genetics.

[23]  Jonathan M. Mudge,et al.  The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. , 2009, Genome research.

[24]  C. Orengo,et al.  Protein function prediction--the power of multiplicity. , 2009, Trends in biotechnology.

[25]  N. Moran,et al.  The Dynamics and Time Scale of Ongoing Genomic Erosion in Symbiotic Bacteria , 2009, Science.

[26]  Raymond Lo,et al.  Pseudomonas Genome Database: facilitating user-friendly, comprehensive comparisons of microbial genomes , 2008, Nucleic Acids Res..

[27]  Robert D. Finn,et al.  InterPro: the integrative protein signature database , 2008, Nucleic Acids Res..

[28]  Tatiana A. Tatusova,et al.  The National Center for Biotechnology Information's Protein Clusters Database , 2008, Nucleic Acids Res..

[29]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences: current status, policy and new initiatives , 2008, Nucleic Acids Res..

[30]  Elisabeth Coudert,et al.  HAMAP: a database of completely sequenced microbial proteome sets and manually curated microbial protein families in UniProtKB/Swiss-Prot , 2008, Nucleic Acids Res..

[31]  Julian I. Rood,et al.  Revised nomenclature for transposable genetic elements. , 2008, Plasmid.

[32]  E. Koonin,et al.  Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world , 2008, Nucleic acids research.

[33]  Nigel W. Hardy,et al.  Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project , 2008, Nature Biotechnology.

[34]  Samuel V. Angiuoli,et al.  Toward an online repository of Standard Operating Procedures (SOPs) for (meta)genomic annotation. , 2008, Omics : a journal of integrative biology.

[35]  Peer Bork,et al.  KEGG Atlas mapping for global analysis of metabolic pathways , 2008, Nucleic Acids Res..

[36]  Chris F. Taylor,et al.  The minimum information about a genome sequence (MIGS) specification , 2008, Nature Biotechnology.

[37]  Rick L. Stevens,et al.  The RAST Server: Rapid Annotations using Subsystems Technology , 2008, BMC Genomics.

[38]  Peter F. Hallin,et al.  RNAmmer: consistent and rapid annotation of ribosomal RNA genes , 2007, Nucleic acids research.

[39]  Rick L. Stevens,et al.  National Institute of Allergy and Infectious Diseases Bioinformatics Resource Centers: New Assets for Pathogen Informatics , 2007, Infection and Immunity.

[40]  Michelle G. Giglio,et al.  TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes , 2006, Nucleic Acids Res..

[41]  Mark Gerstein,et al.  Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation , 2006, Nucleic Acids Res..

[42]  Kiyoko F. Aoki-Kinoshita,et al.  Gene annotation and pathway mapping in KEGG. , 2007, Methods in molecular biology.

[43]  Vasant Honavar,et al.  Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach , 2007, BMC Bioinformatics.

[44]  Hajime Ishikawa,et al.  The 160-Kilobase Genome of the Bacterial Endosymbiont Carsonella , 2006, Science.

[45]  Monica Riley,et al.  Escherichia coli K-12: a cooperatively developed annotation snapshot—2005 , 2006, Nucleic acids research.

[46]  Patricia Siguier,et al.  ISfinder: the reference centre for bacterial insertion sequences , 2005, Nucleic Acids Res..

[47]  Aaron E. Darling,et al.  ASAP: a resource for annotating, curating, comparing, and disseminating genomic data , 2005, Nucleic Acids Res..

[48]  Jaideep P. Sundaram,et al.  Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[49]  M. Noordewier,et al.  Genome Streamlining in a Cosmopolitan Oceanic Bacterium , 2005, Science.

[50]  Michael Y. Galperin,et al.  C‐di‐GMP: the dawning of a novel bacterial signalling system , 2005, Molecular microbiology.

[51]  Peter F. Hallin,et al.  Genome update: 2D clustering of bacterial genomes. , 2005, Microbiology.

[52]  Ingmar Reuter,et al.  Integr8 and Genome Reviews: integrated views of complete genomes and proteomes , 2004, Nucleic Acids Res..

[53]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2004, Nucleic Acids Res..

[54]  Frédéric Partensky,et al.  Accelerated evolution associated with genome reduction in a free-living prokaryote , 2005, Genome Biology.

[55]  A. Moya,et al.  Determination of the Core of a Minimal Bacterial Gene Set , 2004, Microbiology and Molecular Biology Reviews.

[56]  M. Gerstein,et al.  Comprehensive analysis of pseudogenes in prokaryotes: widespread gene decay and failure of putative horizontally transferred genes , 2004, Genome Biology.

[57]  David W Ussery,et al.  Genome Update: annotation quality in sequenced microbial genomes. , 2004, Microbiology.

[58]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[59]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[60]  Manesh Shah,et al.  Genome divergence in two Prochlorococcus ecotypes reflects oceanic niche differentiation , 2003, Nature.

[61]  N. Pace,et al.  The genetic core of the universal ancestor. , 2003, Genome research.

[62]  S. Salzberg,et al.  The Value of Complete Microbial Genome Sequencing (You Get What You Pay For) , 2002, Journal of bacteriology.

[63]  Alexander Souvorov,et al.  The relationship of protein conservation and sequence length , 2002, BMC Evolutionary Biology.

[64]  Peter D Karp,et al.  The past, present and future of genome-wide re-annotation , 2002, Genome Biology.

[65]  Jason E. Stewart,et al.  Minimum information about a microarray experiment (MIAME)—toward standards for microarray data , 2001, Nature Genetics.

[66]  Michael Y. Galperin,et al.  Novel domains of the prokaryotic two-component signal transduction systems. , 2001, FEMS microbiology letters.

[67]  N. Grishin,et al.  GGDEF domain is homologous to adenylyl cyclase , 2001, Proteins.

[68]  M. Hattori,et al.  Genome sequence of the endocellular bacterial symbiont of aphids Buchnera sp. APS , 2000, Nature.

[69]  J. S. Roach What's in a genome? , 2000, Analytical chemistry.

[70]  Kenneth E. Rudd,et al.  EcoGene: a genome sequence database for Escherichia coli K-12 , 2000, Nucleic Acids Res..

[71]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[72]  S. Eddy,et al.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. , 1997, Nucleic acids research.

[73]  Akiyasu C. Yoshizawa,et al.  KAAS: an automatic genome annotation and pathway reconstruction server , 2007, Environmental health perspectives.

[74]  P Bork,et al.  New protein functions in yeast chromosome VIII , 1995, Protein science : a publication of the Protein Society.

[75]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[76]  C. Sander,et al.  Comprehensive sequence analysis of the 182 predicted open reading frames of yeast chromosome III , 1992, Protein science : a publication of the Protein Society.