Effective hidden Markov models for detecting splicing junction sites in DNA sequences

Abstract Identification or prediction of coding sequences from within genomic DNA has been a major rate-limiting step in the pursuit of genes. Programs currently available are far from being powerful enough to elucidate the gene structure completely. In this paper, we develop effective hidden Markov models (HMMs) to represent the consensus and degeneracy features of splicing junction sites in eukaryotic genes. Our HMM system based on the developed HMMs is fully trained using an expectation maximization (EM) algorithm and the system performance is evaluated using a 10-way cross-validation method. Experimental results show that the proposed HMM system can correctly detect 92% of the true donor sites and 91.5% of the true acceptor sites in the test data set containing real vertebrate gene sequences. These results suggest that our approach provide a useful tool in discovering the splicing junction sites in eukaryotic genes.

[1]  Timothy L. Bailey,et al.  An artificial intelligence approach to motif discovery in protein sequences: Application to steroid dehydrogenases , 1997, The Journal of Steroid Biochemistry and Molecular Biology.

[2]  Jason T. L. Wang,et al.  Algorithms for splicing junction donor recognition in genomic DNA sequences , 1998, Proceedings. IEEE International Joint Symposia on Intelligence and Systems (Cat. No.98EX174).

[3]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[4]  B. C. Brookes,et al.  Information Sciences , 2020, Cognitive Skills You Need for the 21st Century.

[5]  Jason T. L. Wang,et al.  Application of hidden Markov models to gene prediction in DNA , 1999, Proceedings 1999 International Conference on Information Intelligence and Systems (Cat. No.PR00446).

[6]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[7]  Dennis Shasha,et al.  New Techniques for DNA Sequence Classification , 1999, J. Comput. Biol..

[8]  Roderic Guigó,et al.  Computational Gene Identification: An Open Problem , 1997, Comput. Chem..

[9]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[10]  J. Davies,et al.  Molecular Biology of the Cell , 1983, Bristol Medico-Chirurgical Journal.

[11]  Jason Tsong-Li Wang,et al.  Application of hidden Markov models to biological data mining: a case study , 2000, SPIE Defense + Commercial Sensing.

[12]  Steven Salzberg,et al.  Finding Genes in DNA with a Hidden Markov Model , 1997, J. Comput. Biol..

[13]  Steven Salzberg,et al.  A method for identifying splice sites and translational start sites in eukaryotic mRNA , 1997, Comput. Appl. Biosci..

[14]  Jean-Michel Claverie,et al.  Detection of Eukaryotic Promoters Using Markov Transition Matrices , 1997, Comput. Chem..

[15]  Garland E. Allen,et al.  The Study of Biology , 1967 .

[16]  Jean-Michel Claverie,et al.  The Difficulty of Identifying Genes in Anonymous Vertebrate Sequences , 1997, Comput. Chem..

[17]  R. Guigó,et al.  Computational gene identification , 1997, Journal of Molecular Medicine.