Hidden Markov Model for Splicing Junction Sites Identification in DNA Sequences

Identification of coding sequence from genomic DNA sequence is the major step in pursuit of gene identification. In the eukaryotic organism, gene structure consists of promoter, intron, start codon, exons and stop codon, etc. and to identify it, accurate labeling of the mentioned segments is necessary. Splice site is the ‘separation’ between exons and introns, the predicted accuracy of which is lower than 90% (in general) though the sequences adjacent to the splice sites have a high conservation. As the accuracy of splice site recognition has not yet been satisfactory (adequate), therefore, much attention has been paid to improve the prediction accuracy and improvement in the algorithms used is very essential element. In this manuscript, Hidden Markov Model (HMM) based splice sites predictor is developed and trained using Modified Expectation Maximization (MEM) algorithm. A 12 fold cross validation technique is also applied to check the reproducibility of the results obtained and to further increase the prediction accuracy. The proposed system can able to achieve the accuracy of 98% of true donor site and 93% for true acceptor site in the standard DNA (nucleotide) sequence.

[1]  Steven Salzberg,et al.  Finding Genes in DNA with a Hidden Markov Model , 1997, J. Comput. Biol..

[2]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[3]  B. Alberts,et al.  [Molecular biology of the cell]. 3. ed. [German] , 1995 .

[4]  R. Reed,et al.  Initial splice-site recognition and pairing during pre-mRNA splicing. , 1996, Current opinion in genetics & development.

[5]  Timothy L. Bailey,et al.  An artificial intelligence approach to motif discovery in protein sequences: Application to steroid dehydrogenases , 1997, The Journal of Steroid Biochemistry and Molecular Biology.

[6]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[7]  Chung-Chin Lu,et al.  Prediction of splice sites with dependency graphs and their expanded bayesian networks , 2005, Bioinform..

[8]  Jagath C. Rajapakse,et al.  Markov encoding for detecting signals in genomic sequences , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[10]  Saman K. Halgamuge,et al.  Fast splice site detection using information content and feature reduction , 2008, BMC Bioinformatics.

[11]  Christopher B. Burge,et al.  Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals , 2003, RECOMB '03.

[12]  Y. Ohshima,et al.  Signals for the selection of a splice site in pre-mRNA. Computer analysis of splice junction sequences and like sequences. , 1987, Journal of molecular biology.

[13]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[14]  A. MacMillan,et al.  Pre-mRNA splicing: a complex picture in higher definition. , 2008, Trends in biochemical sciences.

[15]  Jean-Michel Claverie,et al.  Detection of Eukaryotic Promoters Using Markov Transition Matrices , 1997, Comput. Chem..

[16]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[17]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[18]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[19]  D. Searls,et al.  Gene structure prediction by linguistic methods. , 1994, Genomics.

[20]  H. Prydz,et al.  Evaluation of the exon predictions of the GRAIL software. , 1994, Genomics.

[21]  Jason Tsong-Li Wang,et al.  Effective hidden Markov models for detecting splicing junction sites in DNA sequences , 2001, Inf. Sci..

[22]  V. Brendel,et al.  Prediction of locally optimal splice sites in plant pre-mRNA with applications to gene identification in Arabidopsis thaliana genomic DNA. , 1998, Nucleic acids research.

[23]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[24]  Jason Tsong-Li Wang,et al.  GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences , 2004, Inf. Sci..

[25]  S. Knudsen,et al.  Prediction of human mRNA donor and acceptor sites from the DNA sequence. , 1991, Journal of molecular biology.

[26]  Steven Salzberg,et al.  A method for identifying splice sites and translational start sites in eukaryotic mRNA , 1997, Comput. Appl. Biosci..

[27]  Gajendra P. S. Raghava,et al.  Prediction of nuclear proteins using SVM and HMM models , 2009, BMC Bioinformatics.

[28]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[29]  Yvan Saeys,et al.  SpliceMachine: predicting splice sites from high-dimensional local context representations , 2005, Bioinform..

[30]  Azween Abdullah,et al.  A Novel Optimized Approach for Gene Identification in DNA Sequences , 2011 .

[31]  M. Green,et al.  Identification of a human protein that recognizes the 3′ splice site during the second step of pre‐mRNA splicing , 1997, The EMBO journal.

[32]  T. Moon The expectation-maximization algorithm , 1996, IEEE Signal Process. Mag..

[33]  S. Salzberg,et al.  GeneSplicer: a new computational method for splice site prediction. , 2001, Nucleic acids research.