HMM-FRAME: accurate protein domain classification for metagenomic sequences containing frameshift errors

BackgroundProtein domain classification is an important step in metagenomic annotation. The state-of-the-art method for protein domain classification is profile HMM-based alignment. However, the relatively high rates of insertions and deletions in homopolymer regions of pyrosequencing reads create frameshifts, causing conventional profile HMM alignment tools to generate alignments with marginal scores. This makes error-containing gene fragments unclassifiable with conventional tools. Thus, there is a need for an accurate domain classification tool that can detect and correct sequencing errors.ResultsWe introduce HMM-FRAME, a protein domain classification tool based on an augmented Viterbi algorithm that can incorporate error models from different sequencing platforms. HMM-FRAME corrects sequencing errors and classifies putative gene fragments into domain families. It achieved high error detection sensitivity and specificity in a data set with annotated errors. We applied HMM-FRAME in Targeted Metagenomics and a published metagenomic data set. The results showed that our tool can correct frameshifts in error-containing sequences, generate much longer alignments with significantly smaller E-values, and classify more sequences into their native families.ConclusionsHMM-FRAME provides a complementary protein domain classification tool to conventional profile HMM-based methods for data sets containing frameshifts. Its current implementation is best used for small-scale metagenomic data sets. The source code of HMM-FRAME can be downloaded at http://www.cse.msu.edu/~zhangy72/hmmframe/ and at https://sourceforge.net/projects/hmm-frame/.

[1]  Gregory Kucherov,et al.  Back-translation for discovering distant protein homologies in the presence of frameshift mutations , 2010, Algorithms for Molecular Biology.

[2]  Alexandre Lomsadze,et al.  Frameshift detection in prokaryotic genomic sequences , 2009, Int. J. Bioinform. Res. Appl..

[3]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[4]  Steven Salzberg,et al.  Identifying bacterial genes and endosymbiont DNA with Glimmer , 2007, Bioinform..

[5]  R. Durbin,et al.  GeneWise and Genomewise. , 2004, Genome research.

[6]  Eugene L. Lawler,et al.  Sublinear Expected Time Approximate String Matching and Biological , 1991 .

[7]  C. Quince,et al.  Accurate determination of microbial diversity from 454 pyrosequencing data , 2009, Nature Methods.

[8]  Haixu Tang,et al.  FragGeneScan: predicting genes in short and error-prone reads , 2010, Nucleic acids research.

[9]  Mark Borodovsky,et al.  Genetack: frameshift Identification in protein-Coding Sequences by the Viterbi Algorithm , 2010, J. Bioinform. Comput. Biol..

[10]  T O Yeates,et al.  Searching for frameshift evolutionary relationships between protein sequence families , 1999, Proteins.

[11]  Lior Pachter,et al.  Viral Population Estimation Using Pyrosequencing , 2007, PLoS Comput. Biol..

[12]  Gregory Kucherov,et al.  Back-Translation for Discovering Distant Protein Homologies , 2009, WABI.

[13]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[14]  J. Mattick,et al.  Genome research , 1990, Nature.

[15]  Thomas Schiex,et al.  FrameD: a flexible program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences , 2003, Nucleic Acids Res..

[16]  Hans Söderlund,et al.  Algorithms for the search of amino acid patterns in nucleic acid sequences , 1986, Nucleic Acids Res..

[17]  Eran Halperin,et al.  FramePlus: aligning DNA to protein sequences , 1999, Bioinform..

[18]  Ross A. Overbeek,et al.  The ribosomal database project , 1992, Nucleic Acids Res..

[19]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[20]  William R. Pearson,et al.  Aligning a DNA sequence with a protein sequence , 1997, RECOMB '97.

[21]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[22]  James R. Cole,et al.  Gene-targeted-metagenomics reveals extensive diversity of aromatic dioxygenase genes in the environment , 2009, The ISME Journal.

[23]  Weizhong Li,et al.  Analysis and comparison of very large metagenomes with fast clustering and functional annotation , 2009, BMC Bioinformatics.

[24]  Xiaojun Guan,et al.  Alignments of DNA and protein sequences containing frameshift errors , 1996, Comput. Appl. Biosci..

[25]  D. Gibson,et al.  Aromatic hydrocarbon dioxygenases in environmental biotechnology. , 2000, Current opinion in biotechnology.

[26]  Eugene L. Lawler,et al.  Sublinear approximate string matching and biological applications , 1994, Algorithmica.

[27]  Chris Sander,et al.  Frame: detection of genomic sequencing errors , 1998, Bioinform..

[28]  M. Breitbart,et al.  Using pyrosequencing to shed light on deep mine microbial ecology , 2006, BMC Genomics.

[29]  T. Takagi,et al.  MetaGene: prokaryotic gene finding from environmental genome shotgun sequences , 2006, Nucleic acids research.

[30]  M. Ronaghi,et al.  Characterization of mutation spectra with ultra-deep pyrosequencing: application to HIV-1 drug resistance. , 2007, Genome research.

[31]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[32]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..