Genetack: frameshift Identification in protein-Coding Sequences by the Viterbi Algorithm

We describe a new program for ab initio frameshift detection in protein-coding nucleotide sequences. The task is to distinguish the same strand overlapping ORFs that occur in the sequence due to a presence of a frameshifted gene from the same strand overlapping ORFs that encompass true overlapping or adjacent genes. The GeneTack program uses a hidden Markov model (HMM) of genomic sequence with possibly frameshifted protein-coding regions. The Viterbi algorithm finds the maximum likelihood path that discriminates between true adjacent genes and those adjacent protein-coding regions that just appear to be separate entities due to frameshifts. Therefore, the program can identify spurious predictions made by a conventional gene-finding program misled by a frameshift. We tested GeneTack as well as two earlier developed programs FrameD and FSFind on 17 prokaryotic genomes with frameshifts introduced randomly into known genes. We observed that the average frameshift prediction accuracy of GeneTack, in terms of (Sn + Sp)/2 values, was higher by a significant margin than the accuracy of two other programs. In addition, we observed that the average accuracy of GeneTack is favorably compared with the accuracy of the FSFind-BLAST program that uses protein database search to verify predicted frameshifts, even though GeneTack does not use external evidence. GeneTack is freely available at http://topaz.gatech.edu/GeneTack/.

[1]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[2]  T J Gibson,et al.  PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. , 1996, Nucleic acids research.

[3]  M. Borodovsky,et al.  GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. , 2001, Nucleic acids research.

[4]  Yan Zhang,et al.  Recode-2: new design, new search tools, and many more genes , 2009, Nucleic Acids Res..

[5]  Thomas Schiex,et al.  FrameD: a flexible program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences , 2003, Nucleic Acids Res..

[6]  J. F. Atkins,et al.  Maintenance of the correct open reading frame by the ribosome , 2003, EMBO reports.

[7]  Sanghoon Moon,et al.  Predicting genes expressed via −1 and +1 frameshifts , 2004, Nucleic acids research.

[8]  Michaël Bekaert,et al.  Ornithine decarboxylase antizyme finder (OAF): Fast and reliable detection of antizymes with frameshifts in mRNAs , 2008, BMC Bioinformatics.

[9]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[10]  Alexandre Lomsadze,et al.  Frameshift detection in prokaryotic genomic sequences , 2009, Int. J. Bioinform. Res. Appl..

[11]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[12]  M. Ronaghi Pyrosequencing sheds light on DNA sequencing. , 2001, Genome research.

[13]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[14]  R J Roberts,et al.  Finding errors in DNA sequences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Anders Krogh,et al.  EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance , 2003, BMC Bioinformatics.

[16]  Xiaojun Guan,et al.  Alignments of DNA and protein sequences containing frameshift errors , 1996, Comput. Appl. Biosci..

[17]  C. T. Farley,et al.  Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome , 2008 .

[18]  Robert Giegerich,et al.  KnotInFrame: prediction of −1 ribosomal frameshift events , 2008, Nucleic acids research.

[19]  S. Segal,et al.  Differences between OK and LLC-PK1 cells: cystine handling. , 1991, The American journal of physiology.

[20]  A Danchin,et al.  Detecting and analyzing DNA sequencing errors: toward a higher quality of the Bacillus subtilis genome sequence. , 1999, Genome research.

[21]  J. Claverie,et al.  Detecting frame shifts by amino acid sequence comparison. , 1993, Journal of molecular biology.

[22]  W R Pearson,et al.  Comparison of DNA sequences with protein sequences. , 1997, Genomics.

[23]  M. Borodovsky,et al.  Heuristic approach to deriving models for gene finding. , 1999, Nucleic acids research.

[24]  Felix L. Chernousko,et al.  Finding prokaryotic genes by the 'frame-by-frame' algorithm: targeting gene starts and overlapping genes , 1999, Bioinform..