BMC Bioinformatics BioMed Central Methodology article Gene prediction in metagenomic fragments: A large scale machine

BackgroundMetagenomics is an approach to the characterization of microbial genomes via the direct isolation of genomic sequences from the environment without prior cultivation. The amount of metagenomic sequence data is growing fast while computational methods for metagenome analysis are still in their infancy. In contrast to genomic sequences of single species, which can usually be assembled and analyzed by many available methods, a large proportion of metagenome data remains as unassembled anonymous sequencing reads. One of the aims of all metagenomic sequencing projects is the identification of novel genes. Short length, for example, Sanger sequencing yields on average 700 bp fragments, and unknown phylogenetic origin of most fragments require approaches to gene prediction that are different from the currently available methods for genomes of single species. In particular, the large size of metagenomic samples requires fast and accurate methods with small numbers of false positive predictions.ResultsWe introduce a novel gene prediction algorithm for metagenomic fragments based on a two-stage machine learning approach. In the first stage, we use linear discriminants for monocodon usage, dicodon usage and translation initiation sites to extract features from DNA sequences. In the second stage, an artificial neural network combines these features with open reading frame length and fragment GC-content to compute the probability that this open reading frame encodes a protein. This probability is used for the classification and scoring of gene candidates. With large scale training, our method provides fast single fragment predictions with good sensitivity and specificity on artificially fragmented genomic DNA. Additionally, this method is able to predict translation initiation sites accurately and distinguishes complete from incomplete genes with high reliability.ConclusionLarge scale machine learning methods are well-suited for gene prediction in metagenomic DNA fragments. In particular, the combination of linear discriminants and neural networks is promising and should be considered for integration into metagenomic analysis pipelines. The data sets can be downloaded from the URL provided (see Availability and requirements section).

[1]  T. Takagi,et al.  MetaGene: prokaryotic gene finding from environmental genome shotgun sequences , 2006, Nucleic acids research.

[2]  P. Hugenholtz Exploring prokaryotic diversity in the genomic era , 2002, Genome Biology.

[3]  Burkhard Morgenstern,et al.  TICO: a tool for improving predictions of prokaryotic translation initiation sites , 2005, Bioinform..

[4]  L. Øvreås,et al.  Microbial diversity and function in soil: from genes to ecosystems. , 2002, Current opinion in microbiology.

[5]  Kenneth E. Rudd,et al.  EcoGene: a genome sequence database for Escherichia coli K-12 , 2000, Nucleic Acids Res..

[6]  M. Breitbart,et al.  Using pyrosequencing to shed light on deep mine microbial ecology , 2006, BMC Genomics.

[7]  M. Gelfand,et al.  Starts of bacterial genes: estimating the reliability of computer predictions. , 1999, Gene.

[8]  Ian T. Nabney,et al.  Netlab: Algorithms for Pattern Recognition , 2002 .

[9]  Burkhard Morgenstern,et al.  TICO: a tool for postprocessing the predictions of prokaryotic translation initiation sites , 2006, Nucleic Acids Res..

[10]  K. Schleifer,et al.  Phylogenetic identification and in situ detection of individual microbial cells without cultivation. , 1995, Microbiological reviews.

[11]  J. Handelsman Metagenomics: Application of Genomics to Uncultured Microorganisms , 2004, Microbiology and Molecular Biology Reviews.

[12]  S. Tringe,et al.  Comparative Metagenomics of Microbial Communities , 2004, Science.

[13]  Daniela Bartels,et al.  Finding novel genes in bacterial communities isolated from the environment , 2006, ISMB.

[14]  M. Ronaghi,et al.  A Sequencing Method Based on Real-Time Pyrophosphate , 1998, Science.

[15]  Vladimir B. Bajic,et al.  Dragon Promoter Finder: recognition of vertebrate RNA polymerase II promoters , 2002, Bioinform..

[16]  Maike Tech,et al.  An unsupervised classification scheme for improving predictions of prokaryotic TIS , 2006, BMC Bioinformatics.

[17]  Feng-Biao Guo,et al.  GS-Finder: a program to find bacterial gene start sites with a self-training method. , 2004, The international journal of biochemistry & cell biology.

[18]  Karl-Erich Jaeger,et al.  Prospecting for biocatalysts and drugs in the genomes of non-cultured microorganisms. , 2004, Current opinion in biotechnology.

[19]  Steven Salzberg,et al.  A probabilistic method for identifying start codons in bacterial genomes , 2001, Bioinform..

[20]  J. Handelsman,et al.  Metagenomics: genomic analysis of microbial communities. , 2004, Annual review of genetics.

[21]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[22]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[23]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[24]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[25]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[26]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[27]  S. Voget,et al.  Prospecting for Novel Biocatalysts in a Soil Metagenome , 2003, Applied and Environmental Microbiology.

[28]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[29]  Anders Krogh,et al.  Large-scale prokaryotic gene prediction and comparison to genome annotation , 2005, Bioinform..

[30]  M. Borodovsky,et al.  Heuristic approach to deriving models for gene finding. , 1999, Nucleic acids research.

[31]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[32]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[33]  Rolf Daniel,et al.  The soil metagenome--a rich resource for the discovery of novel natural products. , 2004, Current opinion in biotechnology.

[34]  R. Daniel The metagenomics of soil , 2005, Nature Reviews Microbiology.

[35]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[36]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[37]  Lior Pachter,et al.  Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities , 2005, PLoS Comput. Biol..

[38]  Timothy T. Harkins,et al.  Metagenomics analysis using the Genome Sequencer™ FLX system , 2007 .

[39]  S. Giovannoni,et al.  The uncultured microbial majority. , 2003, Annual review of microbiology.