Genome Majority Vote Improves Gene Predictions

Recent studies have noted extensive inconsistencies in gene start sites among orthologous genes in related microbial genomes. Here we provide the first documented evidence that imposing gene start consistency improves the accuracy of gene start-site prediction. We applied an algorithm using a genome majority vote (GMV) scheme to increase the consistency of gene starts among orthologs. We used a set of validated Escherichia coli genes as a standard to quantify accuracy. Results showed that the GMV algorithm can correct hundreds of gene prediction errors in sets of five or ten genomes while introducing few errors. Using a conservative calculation, we project that GMV would resolve many inconsistencies and errors in publicly available microbial gene maps. Our simple and logical solution provides a notable advance toward accurate gene maps.

[1]  Michael E Wall,et al.  Consistency of gene starts among Burkholderia genomes , 2011, BMC Genomics.

[2]  Chuong B. Do,et al.  CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction , 2007, Genome Biology.

[3]  M. Brent,et al.  Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. , 2003, Genome research.

[4]  Michael R. Brent,et al.  Using Multiple Alignments to Improve Gene Prediction , 2005, RECOMB.

[5]  Simon Kasif,et al.  A comparative genomic method for computational identification of prokaryotic translation initiation sites. , 2002, Nucleic acids research.

[6]  Natalia N. Ivanova,et al.  GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes , 2010, Nature Methods.

[7]  Monica Riley,et al.  Escherichia coli K-12: a cooperatively developed annotation snapshot—2005 , 2006, Nucleic acids research.

[8]  R. Myers,et al.  Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data , 2005, Nucleic acids research.

[9]  Charles J. Vaske,et al.  Gene prediction and verification in a compact genome with numerous small introns. , 2004, Genome research.

[10]  Rick L. Stevens,et al.  The RAST Server: Rapid Annotations using Subsystems Technology , 2008, BMC Genomics.

[11]  W. Bruno Modeling residue usage in aligned protein sequences via maximum likelihood. , 1996, Molecular biology and evolution.

[12]  Frank Allgöwer,et al.  Identification of models of heterogeneous cell populations from population snapshot data , 2011, BMC Bioinformatics.

[13]  P. Bork,et al.  Large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions? , 2008, BMC Genomics.

[14]  R. Guigó,et al.  Comparative gene prediction in human and mouse. , 2003, Genome research.

[15]  V. Kunin,et al.  Genomic Analysis of “Elusimicrobium minutum,” the First Cultivated Representative of the Phylum “Elusimicrobia” (Formerly Termite Group 1) , 2009, Applied and Environmental Microbiology.

[16]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[17]  J. Gogarten,et al.  Using comparative genome analysis to identify problems in annotated microbial genomes. , 2010, Microbiology.

[18]  M. Brent,et al.  Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[19]  C. Médigue,et al.  MaGe: a microbial genome annotation system supported by synteny results , 2006, Nucleic acids research.

[20]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[21]  M. Brent Steady progress and recent breakthroughs in the accuracy of automated genome annotation , 2008, Nature Reviews Genetics.

[22]  A. Halpern,et al.  Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. , 2000, Molecular biology and evolution.

[23]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[24]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[25]  L. Pachter,et al.  SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. , 2003, Genome research.

[26]  Kenneth E. Rudd,et al.  EcoGene: a genome sequence database for Escherichia coli K-12 , 2000, Nucleic Acids Res..

[27]  Jacqueline A. Servin,et al.  Evidence for a gram-positive, eubacterial root of the tree of life. , 2007, Molecular biology and evolution.

[28]  Mark Borodovsky,et al.  GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses , 2005, Nucleic Acids Res..

[29]  Steven Salzberg,et al.  Identifying bacterial genes and endosymbiont DNA with Glimmer , 2007, Bioinform..