论文信息 - Using a VOM model for reconstructing potential coding regions in EST sequences

Using a VOM model for reconstructing potential coding regions in EST sequences

This paper presents a method for annotating coding and noncoding DNA regions by using variable order Markov (VOM) models. A main advantage in using VOM models is that their order may vary for different sequences, depending on the sequences’ statistics. As a result, VOM models are more flexible with respect to model parameterization and can be trained on relatively short sequences and on low-quality datasets, such as expressed sequence tags (ESTs). The paper presents a modified VOM model for detecting and correcting insertion and deletion sequencing errors that are commonly found in ESTs. In a series of experiments the proposed method is found to be robust to random errors in these sequences.

Armin Shmilovici | Irad Ben-Gal | I. Ben-Gal | A. Shmilovici

[1] Jacob Ziv. A universal prediction lemma and applications to universal data compression and prediction , 2001, IEEE Trans. Inf. Theory.

[2] J. Fickett,et al. Assessment of protein coding measures. , 1992, Nucleic acids research.

[3] S. Salzberg,et al. Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[4] Michal Linial,et al. Locating Transcription Factors Binding Sites Using a Variable Memory Markov Model , 2002 .

[5] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[6] E. Wingender,et al. MATCH: A tool for searching transcription factor binding sites in DNA sequences. , 2003, Nucleic acids research.

[7] JORMA RISSANEN,et al. A universal data compression system , 1983, IEEE Trans. Inf. Theory.

[8] Armin Shmilovici,et al. Context-Based Statistical Process Control , 2003, Technometrics.

[9] Golan Yona,et al. Variations on probabilistic suffix trees: statistical modeling and prediction of protein families , 2001, Bioinform..

[10] Armin Shmilovici,et al. Identification of transcription factor binding sites with variable-order Bayesian networks , 2005, Bioinform..

[11] Anders Krogh,et al. EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance , 2003, BMC Bioinformatics.

[12] C. V. Jongeneel,et al. Modeling sequencing errors by combining Hidden Markov models , 2003, ECCB.

[13] Ying Xu,et al. Correcting sequencing errors in DNA coding regions using a dynamic programming approach , 1995, Comput. Appl. Biosci..

[14] Steven Salzberg,et al. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders , 2004, Bioinform..

[15] H E Stanley,et al. Finding borders between coding and noncoding DNA regions by an entropic segmentation method. , 2000, Physical review letters.

[16] Neri Merhav,et al. Relations between entropy and error probability , 1994, IEEE Trans. Inf. Theory.

[17] J W Fickett,et al. Finding genes by computer: the state of the art. , 1996, Trends in genetics : TIG.

[18] I. Grosse,et al. MEASURING CORRELATIONS IN SYMBOL SEQUENCES , 1995 .

[19] Alexander E. Kel,et al. MATCHTM: a tool for searching transcription factor binding sites in DNA sequences , 2003, Nucleic Acids Res..

[20] H Niemann,et al. Identification and analysis of eukaryotic promoters: recent computational approaches. , 2001, Trends in genetics : TIG.

[21] I. Ben-Gal,et al. A VOM based gene-finder that specializes in short genes , 2004, 2004 23rd IEEE Convention of Electrical and Electronics Engineers in Israel.

[22] Simon Cawley,et al. HMM sampling and applications to gene finding and alternative splicing , 2003, ECCB.

[23] Daniel Hanisch,et al. Co-clustering of biological networks and gene expression data , 2002, ISMB.

[24] Armin Shmilovici,et al. CSPC: A Monitoring Procedure for State Dependent Processes , 2003 .

[25] S. Karlin,et al. Finding the genes in genomic DNA. , 1998, Current opinion in structural biology.

[26] Chris Sander,et al. Frame: detection of genomic sequencing errors , 1998, Bioinform..

[27] Ran El-Yaniv,et al. On Prediction Using Variable Order Markov Models , 2004, J. Artif. Intell. Res..

[28] Jean-Philippe Vert,et al. Adaptive context trees and text clustering , 2001, IEEE Trans. Inf. Theory.

[29] Elmar Nöth,et al. Interpolated markov chains for eukaryotic promoter recognition , 1999, Bioinform..

[30] Armin Shmilovici,et al. Using a Stochastic Complexity Measure to Check the Efficient Market Hypothesis , 2003 .

[31] C. V. Jongeneel,et al. ESTScan: A Program for Detecting, Evaluating, and Reconstructing Potential Coding Regions in EST Sequences , 1999, ISMB.

[32] Martin Reczko,et al. DIANA-EST: a statistical analysis , 2001, Bioinform..

[33] Daniel G. Brown,et al. ExonHunter: a comprehensive approach to gene finding , 2005, ISMB.

[34] Yuriy L. Orlov,et al. Construction of stochastic context trees for genetical texts , 2002, Silico Biol..