An Integrative Method for Identifying the Over-Annotated Protein-Coding Genes in Microbial Genomes

The falsely annotated protein-coding genes have been deemed one of the major causes accounting for the annotating errors in public databases. Although many filtering approaches have been designed for the over-annotated protein-coding genes, some are questionable due to the resultant increase in false negative. Furthermore, there is no webserver or software specifically devised for the problem of over-annotation. In this study, we propose an integrative algorithm for detecting the over-annotated protein-coding genes in microorganisms. Overall, an average accuracy of 99.94% is achieved over 61 microbial genomes. The extremely high accuracy indicates that the presented algorithm is efficient to differentiate the protein-coding genes from the non-coding open reading frames. Abundant analyses show that the predicting results are reliable and the integrative algorithm is robust and convenient. Our analysis also indicates that the over-annotated protein-coding genes can cause the false positive of horizontal gene transfers detection. The webserver of the proposed algorithm can be freely accessible from www.cbi.seu.edu.cn/RPGM.

[1]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[2]  C. Zhang,et al.  Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve. , 2000, Nucleic acids research.

[3]  J. Mortimer,et al.  Chargaff's legacy. , 2000, Gene.

[4]  Changchuan Yin,et al.  A Novel Construction of Genome Space with Biological Geometry , 2010, DNA research : an international journal for rapid publication of reports on genes and genomes.

[5]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[6]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[7]  Huaiqiu Zhu,et al.  Genome reannotation of Escherichia coli CFT073 with new insights into virulence , 2009, BMC Genomics.

[8]  S Brunak,et al.  On the total number of genes and their length distribution in complete microbial genomes. , 2001, Trends in genetics : TIG.

[9]  Feng-Biao Guo,et al.  Re-prediction of protein-coding genes in the genome of Amsacta moorei entomopoxvirus. , 2007, Journal of virological methods.

[10]  Georgios S. Vernikos,et al.  Genetic flux over time in the Salmonella lineage , 2007, Genome Biology.

[11]  Folker Meyer,et al.  Development of joint application strategies for two microbial gene finders , 2004, Bioinform..

[12]  David W Ussery,et al.  Genome Update: annotation quality in sequenced microbial genomes. , 2004, Microbiology.

[13]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence Project: update and current status , 2003, Nucleic Acids Res..

[14]  Timothy J. Harlow,et al.  Do different surrogate methods detect lateral genetic transfer events of different relative ages? , 2006, Trends in microbiology.

[15]  Hedvig Tordai,et al.  Identification and correction of abnormal, incomplete and mispredicted proteins in public databases , 2008, BMC Bioinformatics.

[16]  Santiago Garcia-Vallvé,et al.  HGT-DB: a database of putative horizontally transferred genes in prokaryotic complete genomes , 2003, Nucleic Acids Res..

[17]  Feng-Biao Guo,et al.  Identify Protein-coding Genes in the Genomes of Aeropyrum pernix K1 and Chlorobium tepidum TLS , 2009, Journal of biomolecular structure & dynamics.

[18]  A. Valencia,et al.  Intrinsic errors in genome annotation. , 2001, Trends in genetics : TIG.

[19]  E. Koonin,et al.  Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world , 2008, Nucleic acids research.

[20]  Byron Gallis,et al.  Comparison of Francisella tularensis genomes reveals evolutionary events associated with the emergence of human pathogenic strains , 2007, Genome Biology.

[21]  Wu-chun Feng,et al.  Missing genes in the annotation of prokaryotic genomes , 2010, BMC Bioinformatics.

[22]  Michael Y. Galperin,et al.  Towards understanding the first genome sequence of a crenarchaeon by genome annotation using clusters of orthologous groups of proteins (COGs) , 2000, Genome Biology.

[23]  C. Zhang,et al.  Identification of protein-coding genes in the genome of Vibrio cholerae with more than 98% accuracy using occurrence frequencies of single nucleotides. , 2001, European journal of biochemistry.

[24]  Feng Gao,et al.  Comparison of various algorithms for recognizing short coding sequences of human genes , 2004, Bioinform..

[25]  Xiao Sun,et al.  TN curve: A novel 3D graphical representation of DNA sequence based on trinucleotides and its applications , 2009, Journal of Theoretical Biology.

[26]  Rainer Merkl,et al.  YACOP: Enhanced gene prediction obtained by a combination of existing methods , 2003, Silico Biol..

[27]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[28]  Feng-Biao Guo,et al.  Gene recognition based on nucleotide distribution of ORFs in a hyper-thermophilic crenarchaeon, Aeropyrum pernix K1. , 2004, DNA research : an international journal for rapid publication of reports on genes and genomes.

[29]  S. Salzberg,et al.  DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae , 2000, Nature.

[30]  Melissa Da Silva,et al.  Using purine skews to predict genes in AT-rich poxviruses , 2005, BMC Genomics.

[31]  Claudine Médigue,et al.  Re-annotation of the genome sequence of Mycobacterium tuberculosis H37Rv. , 2002, Microbiology.

[32]  S. Salzberg Genome re-annotation: a wiki solution? , 2007, Genome Biology.

[33]  S. Osawa,et al.  The guanine and cytosine content of genomic DNA and bacterial evolution. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Priorities in tropical biology. , 1994, Trends in ecology & evolution.

[35]  J. Gogarten,et al.  Using comparative genome analysis to identify problems in annotated microbial genomes. , 2010, Microbiology.

[36]  Peter D Karp,et al.  The past, present and future of genome-wide re-annotation , 2002, Genome Biology.

[37]  G Bernardi,et al.  Second codon positions of genes and the secondary structures of proteins. Relationships and implications for the origin of the genetic code. , 2000, Gene.

[38]  E. Holmes,et al.  The evolution of base composition and phylogenetic inference. , 2000, Trends in ecology & evolution.

[39]  Xiao Sun,et al.  Reannotation of protein‐coding genes based on an improved graphical representation of DNA sequence , 2010, J. Comput. Chem..

[40]  Ute Baumann,et al.  Estimating the annotation error rate of curated GO database sequence annotations , 2007, BMC Bioinformatics.

[41]  Rajeev K. Azad,et al.  Detecting laterally transferred genes: use of entropic clustering methods and genome position , 2007, Nucleic acids research.

[42]  Feng-Biao Guo,et al.  ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. , 2003, Nucleic acids research.

[43]  Antoine Danchin,et al.  Re-annotation of genome microbial CoDing-Sequences: finding new genes and inaccurately annotated genes , 2002, BMC Bioinformatics.

[44]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[45]  Chun-Ting Zhang,et al.  Gene Recognition from Questionable ORFs in Bacterial and Archaeal Genomes , 2003, Journal of biomolecular structure & dynamics.

[46]  Howard Ochman,et al.  Distinguishing the ORFs from the ELFs: short bacterial genes and the annotation of genomes. , 2002, Trends in genetics : TIG.

[47]  Xiao Sun,et al.  Analysis of Similarities/Dissimilarities of DNA Sequences Based on a Novel Graphical Representation , 2010 .

[48]  Anders Krogh,et al.  Large-scale prokaryotic gene prediction and comparison to genome annotation , 2005, Bioinform..

[49]  Yangrae Cho,et al.  Computational methods for gene annotation: the Arabidopsis genome. , 2001, Current opinion in biotechnology.

[50]  J. Claverie,et al.  Horizontal gene transfer and nucleotide compositional anomaly in large DNA viruses , 2007, BMC Genomics.

[51]  P. Sharp,et al.  Codon usage in regulatory genes in Escherichia coli does not reflect selection for 'rare' codons. , 1986, Nucleic acids research.

[52]  E. Trifonov Translation framing code and frame-monitoring mechanism as suggested by the analysis of mRNA and 16 S rRNA nucleotide sequences. , 1987, Journal of molecular biology.

[53]  R Zhang,et al.  Analysis of distribution of bases in the coding sequences by a diagrammatic technique. , 1991, Nucleic acids research.

[54]  E. Hamori,et al.  H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences. , 1983, The Journal of biological chemistry.

[55]  S. Brenner Errors in genome annotation. , 1999, Trends in genetics : TIG.

[56]  Julian Parkhill,et al.  Re-annotation and re-analysis of the Campylobacter jejuni NCTC11168 genome sequence , 2007, BMC Genomics.

[57]  S. Garcia-Vallvé,et al.  Horizontal gene transfer in bacterial and archaeal complete genomes. , 2000, Genome research.

[58]  Bin-Guang Ma,et al.  Reannotation of hypothetical ORFs in plant pathogen Erwinia carotovora subsp. atroseptica SCRI1043 , 2008, The FEBS journal.

[59]  M. Gerstein,et al.  Comprehensive analysis of pseudogenes in prokaryotes: widespread gene decay and failure of putative horizontally transferred genes , 2004, Genome Biology.

[60]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[61]  T C Ghosh,et al.  Studies on the relationships between the synonymous codon usage and protein secondary structural units. , 2000, Biochemical and biophysical research communications.

[62]  P. Bork,et al.  Large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions? , 2008, BMC Genomics.

[63]  Laurie J. Heyer,et al.  Evaluation of Three Automated Genome Annotations for Halorhabdus utahensis , 2009, PloS one.