Combining gene prediction methods to improve metagenomic gene annotation

BackgroundTraditional gene annotation methods rely on characteristics that may not be available in short reads generated from next generation technology, resulting in suboptimal performance for metagenomic (environmental) samples. Therefore, in recent years, new programs have been developed that optimize performance on short reads. In this work, we benchmark three metagenomic gene prediction programs and combine their predictions to improve metagenomic read gene annotation.ResultsWe not only analyze the programs' performance at different read-lengths like similar studies, but also separate different types of reads, including intra- and intergenic regions, for analysis. The main deficiencies are in the algorithms' ability to predict non-coding regions and gene edges, resulting in more false-positives and false-negatives than desired. In fact, the specificities of the algorithms are notably worse than the sensitivities. By combining the programs' predictions, we show significant improvement in specificity at minimal cost to sensitivity, resulting in 4% improvement in accuracy for 100 bp reads with ~1% improvement in accuracy for 200 bp reads and above. To correctly annotate the start and stop of the genes, we find that a consensus of all the predictors performs best for shorter read lengths while a unanimous agreement is better for longer read lengths, boosting annotation accuracy by 1-8%. We also demonstrate use of the classifier combinations on a real dataset.ConclusionsTo optimize the performance for both prediction and annotation accuracies, we conclude that the consensus of all methods (or a majority vote) is the best for reads 400 bp and shorter, while using the intersection of GeneMark and Orphelia predictions is the best for reads 500 bp and longer. We demonstrate that most methods predict over 80% coding (including partially coding) reads on a real human gut sample sequenced by Illumina technology.

[1]  Paul Levi,et al.  GENIO/scan - EST Guided Identification of Genes in Human Genomic DNA , 1998, German Conference on Bioinformatics.

[2]  M. Borodovsky,et al.  Heuristic approach to deriving models for gene finding. , 1999, Nucleic acids research.

[3]  Eugene W. Myers,et al.  Basic local alignment search tool. Journal of Molecular Biology , 1990 .

[4]  Alan K. Mackworth,et al.  GeneComber: Combining Outputs of Gene Prediction Programs for Improved Results , 2003, Bioinform..

[5]  T. Takagi,et al.  MetaGene: prokaryotic gene finding from environmental genome shotgun sequences , 2006, Nucleic acids research.

[6]  Katharina J. Hoff,et al.  Orphelia: predicting genes in metagenomic sequencing reads , 2009, Nucleic Acids Res..

[7]  Steven Salzberg,et al.  Identifying bacterial genes and endosymbiont DNA with Glimmer , 2007, Bioinform..

[8]  R. Durbin,et al.  GeneWise and Genomewise. , 2004, Genome research.

[9]  M. Borodovsky,et al.  GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. , 2001, Nucleic acids research.

[10]  Gail Rosen,et al.  Benchmarking of gene prediction programs for metagenomic data , 2010, 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology.

[11]  D. Haussler,et al.  Genie--gene finding in Drosophila melanogaster. , 2000, Genome research.

[12]  J. Mattick,et al.  The relationship between non-protein-coding DNA and eukaryotic complexity. , 2007, BioEssays : news and reviews in molecular, cellular and developmental biology.

[13]  Adam Weintrit,et al.  Methods and Algorithms , 2011 .

[14]  T. Itoh,et al.  MetaGeneAnnotator: Detecting Species-Specific Patterns of Ribosomal Binding Site for Precise Gene Prediction in Anonymous Prokaryotic and Phage Genomes , 2008, DNA research : an international journal for rapid publication of reports on genes and genomes.

[15]  Toshihisa Takagi,et al.  DIGIT: A Novel Gene Finding Program by Combining Gene-Finders , 2002, Pacific Symposium on Biocomputing.

[16]  R. Guigó,et al.  GeneID in Drosophila. , 2000, Genome research.

[17]  M. Borodovsky,et al.  Ab initio gene identification in metagenomic sequences , 2010, Nucleic acids research.

[18]  Mario Stanke,et al.  Gene prediction with a hidden Markov model and a new intron submodel , 2003, ECCB.

[19]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[20]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[21]  Katharina J. Hoff,et al.  Gene prediction in metagenomic fragments: A large scale machine learning approach , 2008, BMC Bioinformatics.

[22]  Andrei Zinovyev,et al.  How much non-coding DNA do eukaryotes require? , 2006, Journal of theoretical biology.

[23]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[24]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[25]  Xiaohua Hu,et al.  Average gene length is highly conserved in prokaryotes and eukaryotes and diverges only between the two kingdoms. , 2006, Molecular biology and evolution.

[26]  R. Polikar,et al.  Bootstrap - Inspired Techniques in Computation Intelligence , 2007, IEEE Signal Processing Magazine.

[27]  Jonathan E. Allen,et al.  JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions , 2006, Genome Biology.

[28]  Alexander Sczyrba,et al.  AGenDA: homology-based gene prediction , 2003, Bioinform..

[29]  Vladimir Pavlovic,et al.  A Bayesian framework for combining gene predictions , 2002, Bioinform..

[30]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.