GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

Automated detection or prediction of coding sequences from within genomic DNA has been a major rate-limiting step in the pursuit of vertebrate genes. Programs currently available are far from being powerful enough to elucidate a gent structure completely. In this paper, we present a new system, called GeneScout, for predicting gene structures in vertebrate genomic DNA. The system contains specially designed hidden Markov models (HMMs) for detecting functional sites including proteintranslation start sites, mRNA splicing junction donor and acceptor sites, etc. An HMM model is also proposed for exon coding potential computation. Our main hypothesis is that, given a vertebrate genomic DNA sequence S, it is always possible to construct a directed acyclic graph G such that the path for the actual coding region of S is in the set of all paths on G. Thus, the gene detection problem is reduced to that of analyzing the paths in the graph G. A dynamic programming algorithm is used to lind the optimal path in G. The proposed system is trained using an expectation-maximization algorithm and its performance on vertebrate gene prediction is evaluated using the 10-way cross-validation method. Experimental results show that the proposed system performs well and is comparable to existing gene discovery tools.

[1]  Ying Xu,et al.  Constructing gene models from accurately predicted exons: an application of dynamic programming , 1994, Comput. Appl. Biosci..

[2]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[3]  M S Gelfand,et al.  Prediction of function in DNA sequence analysis. , 1995, Journal of computational biology : a journal of computational molecular cell biology.

[4]  Steven Salzberg,et al.  Finding Genes in DNA with a Hidden Markov Model , 1997, J. Comput. Biol..

[5]  Jean-Michel Claverie,et al.  The Difficulty of Identifying Genes in Anonymous Vertebrate Sequences , 1997, Comput. Chem..

[6]  E. Snyder,et al.  Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. , 1993, Nucleic acids research.

[7]  Jason Tsong-Li Wang,et al.  Biological Data Mining Using Bayesian Neural Networks: A Case Study , 1999, Int. J. Artif. Intell. Tools.

[8]  Steven Salzberg,et al.  A method for identifying splice sites and translational start sites in eukaryotic mRNA , 1997, Comput. Appl. Biosci..

[9]  J. Hawkins A survey on intron and exon lengths. , 1988, Nucleic acids research.

[10]  Jason Tsong-Li Wang,et al.  Effective hidden Markov models for detecting splicing junction sites in DNA sequences , 2001, Inf. Sci..

[11]  B A Shapiro,et al.  Complementary classification approaches for protein sequences. , 1996, Protein engineering.

[12]  Dennis Shasha,et al.  DNA sequence classification via an expectation maximization algorithm and neural networks: a case study , 2001, IEEE Trans. Syst. Man Cybern. Part C.

[13]  Roderic Guigó,et al.  Computational Gene Identification: An Open Problem , 1997, Comput. Chem..

[14]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[15]  Raj Reddy,et al.  Automatic Speech Recognition: The Development of the Sphinx Recognition System , 1988 .

[16]  M H Skolnick,et al.  A probabilistic model for detecting coding regions in DNA sequences. , 1994, IMA journal of mathematics applied in medicine and biology.

[17]  V. Solovyev,et al.  Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. , 1994, Nucleic acids research.

[18]  S. Salzberg,et al.  Interpolated Markov models for eukaryotic gene finding. , 1999, Genomics.

[19]  D. Searls,et al.  Gene structure prediction by linguistic methods. , 1994, Genomics.

[20]  Dennis Shasha,et al.  New Techniques for DNA Sequence Classification , 1999, J. Comput. Biol..

[21]  Jason Tsong-Li Wang,et al.  Application of hidden Markov models to biological data mining: a case study , 2000, SPIE Defense + Commercial Sensing.

[22]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[23]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[24]  Michael R. Hayden,et al.  The prediction of exons through an analysis of spliceable open reading frames , 1992, Nucleic Acids Res..

[25]  Dennis Shasha,et al.  Application of neural networks to biological data mining: a case study in protein sequence classification , 2000, KDD '00.

[26]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[27]  Laxmi Parida Pattern Discovery in Biomolecular Data: Tools, Techniques and Applications , 1999 .

[28]  Paul P. Wang,et al.  Computational Biology and Genome Informatics , 2003 .

[29]  E. Uberbacher,et al.  Gene recognition and assembly in the GRAIL system: Progress and challenges , 1993 .

[30]  Y. Lida,et al.  DNA sequences and multivariate statistical analysis. Categorical discrimination approach to 5' splice site signals of mRNA precursors in higher eukaryotes' genes , 1987, Comput. Appl. Biosci..

[31]  J. Davies,et al.  Molecular Biology of the Cell , 1983, Bristol Medico-Chirurgical Journal.

[32]  Belur V. Dasarathy Data Mining and Knowledge Discovery: Theory, Tools, and Technology III , 2001 .

[33]  Mikhail S. Gelfand,et al.  Combinatorial Approaches to Gene Recognition , 1997, Comput. Chem..

[34]  G. M. Suboch,et al.  Analysis of nonuniformity in intron phase distribution. , 1992, Nucleic acids research.

[35]  Timothy L. Bailey,et al.  An artificial intelligence approach to motif discovery in protein sequences: Application to steroid dehydrogenases , 1997, The Journal of Steroid Biochemistry and Molecular Biology.

[36]  Jason T. L. Wang,et al.  Application of hidden Markov models to gene prediction in DNA , 1999, Proceedings 1999 International Conference on Information Intelligence and Systems (Cat. No.PR00446).

[37]  Jason T. L. Wang,et al.  Algorithms for splicing junction donor recognition in genomic DNA sequences , 1998, Proceedings. IEEE International Joint Symposia on Intelligence and Systems (Cat. No.98EX174).

[38]  Jason T. L. Wang,et al.  Knowledge discovery and modeling in genomic databases , 2002 .

[39]  Jean-Michel Claverie,et al.  Detection of Eukaryotic Promoters Using Markov Transition Matrices , 1997, Comput. Chem..

[40]  T. H. HUXLEY On the Study of Biology , 1877, Nature.

[41]  Dennis Shasha,et al.  New techniques for extracting features from protein sequences , 2001, IBM Syst. J..

[42]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[43]  J. Hawkins,et al.  A survey on intron and exon lengths. , 1988, Nucleic acids research.

[44]  Alexander E. Kel,et al.  GenViewer: A computing tool for protein-coding regions prediction in nucleotide sequences , 1993 .