Using a VOM model for reconstructing potential coding regions in EST sequences

This paper presents a method for annotating coding and noncoding DNA regions by using variable order Markov (VOM) models. A main advantage in using VOM models is that their order may vary for different sequences, depending on the sequences’ statistics. As a result, VOM models are more flexible with respect to model parameterization and can be trained on relatively short sequences and on low-quality datasets, such as expressed sequence tags (ESTs). The paper presents a modified VOM model for detecting and correcting insertion and deletion sequencing errors that are commonly found in ESTs. In a series of experiments the proposed method is found to be robust to random errors in these sequences.

[1]  Jacob Ziv A universal prediction lemma and applications to universal data compression and prediction , 2001, IEEE Trans. Inf. Theory.

[2]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[3]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[4]  Michal Linial,et al.  Locating Transcription Factors Binding Sites Using a Variable Memory Markov Model , 2002 .

[5]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[6]  E. Wingender,et al.  MATCH: A tool for searching transcription factor binding sites in DNA sequences. , 2003, Nucleic acids research.

[7]  JORMA RISSANEN,et al.  A universal data compression system , 1983, IEEE Trans. Inf. Theory.

[8]  Armin Shmilovici,et al.  Context-Based Statistical Process Control , 2003, Technometrics.

[9]  Golan Yona,et al.  Variations on probabilistic suffix trees: statistical modeling and prediction of protein families , 2001, Bioinform..

[10]  Armin Shmilovici,et al.  Identification of transcription factor binding sites with variable-order Bayesian networks , 2005, Bioinform..

[11]  Anders Krogh,et al.  EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance , 2003, BMC Bioinformatics.

[12]  C. V. Jongeneel,et al.  Modeling sequencing errors by combining Hidden Markov models , 2003, ECCB.

[13]  Ying Xu,et al.  Correcting sequencing errors in DNA coding regions using a dynamic programming approach , 1995, Comput. Appl. Biosci..

[14]  Steven Salzberg,et al.  TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders , 2004, Bioinform..

[15]  H E Stanley,et al.  Finding borders between coding and noncoding DNA regions by an entropic segmentation method. , 2000, Physical review letters.

[16]  Neri Merhav,et al.  Relations between entropy and error probability , 1994, IEEE Trans. Inf. Theory.

[17]  J W Fickett,et al.  Finding genes by computer: the state of the art. , 1996, Trends in genetics : TIG.

[18]  I. Grosse,et al.  MEASURING CORRELATIONS IN SYMBOL SEQUENCES , 1995 .

[19]  Alexander E. Kel,et al.  MATCHTM: a tool for searching transcription factor binding sites in DNA sequences , 2003, Nucleic Acids Res..

[20]  H Niemann,et al.  Identification and analysis of eukaryotic promoters: recent computational approaches. , 2001, Trends in genetics : TIG.

[21]  I. Ben-Gal,et al.  A VOM based gene-finder that specializes in short genes , 2004, 2004 23rd IEEE Convention of Electrical and Electronics Engineers in Israel.

[22]  Simon Cawley,et al.  HMM sampling and applications to gene finding and alternative splicing , 2003, ECCB.

[23]  Daniel Hanisch,et al.  Co-clustering of biological networks and gene expression data , 2002, ISMB.

[24]  Armin Shmilovici,et al.  CSPC: A Monitoring Procedure for State Dependent Processes , 2003 .

[25]  S. Karlin,et al.  Finding the genes in genomic DNA. , 1998, Current opinion in structural biology.

[26]  Chris Sander,et al.  Frame: detection of genomic sequencing errors , 1998, Bioinform..

[27]  Ran El-Yaniv,et al.  On Prediction Using Variable Order Markov Models , 2004, J. Artif. Intell. Res..

[28]  Jean-Philippe Vert,et al.  Adaptive context trees and text clustering , 2001, IEEE Trans. Inf. Theory.

[29]  Elmar Nöth,et al.  Interpolated markov chains for eukaryotic promoter recognition , 1999, Bioinform..

[30]  Armin Shmilovici,et al.  Using a Stochastic Complexity Measure to Check the Efficient Market Hypothesis , 2003 .

[31]  C. V. Jongeneel,et al.  ESTScan: A Program for Detecting, Evaluating, and Reconstructing Potential Coding Regions in EST Sequences , 1999, ISMB.

[32]  Martin Reczko,et al.  DIANA-EST: a statistical analysis , 2001, Bioinform..

[33]  Daniel G. Brown,et al.  ExonHunter: a comprehensive approach to gene finding , 2005, ISMB.

[34]  Yuriy L. Orlov,et al.  Construction of stochastic context trees for genetical texts , 2002, Silico Biol..