Splice site prediction using stochastic regular grammars.

This paper presents a novel approach to the problem of splice site prediction, by applying stochastic grammar inference. We used four grammar inference algorithms to infer 1465 grammars, and used 10-fold cross-validation to select the best grammar for each algorithm. The corresponding grammars were embedded into a classifier and used to run splice site prediction and compare the results with those of NNSPLICE, the predictor used by the Genie gene finder. We indicate possible paths to improve this performance by using Sakakibara's windowing technique to find probability thresholds that will lower false-positive predictions.

[1]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[2]  Dana Ron,et al.  The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[3]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[4]  Anders Krogh,et al.  Two Methods for Improving Performance of a HMM and their Application for Gene Finding , 1997, ISMB.

[5]  Anders Krogh,et al.  Neural Network Ensembles, Cross Validation, and Active Learning , 1994, NIPS.

[6]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[7]  Naoki Abe,et al.  On the Computational Complexity of Approximating Distributions by Probabilistic Automata , 1990, Annual Conference Computational Learning Theory.

[8]  D. Haussler,et al.  Stochastic context-free grammars for modeling RNA , 1993, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[9]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Dana Ron,et al.  On the learnability and usage of acyclic probabilistic finite automata , 1995, COLT '95.

[11]  John Langford,et al.  Beating the hold-out: bounds for K-fold and progressive cross-validation , 1999, COLT '99.

[12]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[13]  Naoki Abe,et al.  On the computational complexity of approximating distributions by probabilistic automata , 1990, Machine Learning.

[14]  V. Solovyev,et al.  Ab initio gene finding in Drosophila genomic DNA. , 2000, Genome research.

[15]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[16]  José Oncina,et al.  Learning Stochastic Regular Grammars by Means of a State Merging Method , 1994, ICGI.

[17]  Thomas G. Dietterich Overfitting and undercomputing in machine learning , 1995, CSUR.

[18]  S. Knudsen,et al.  Prediction of human mRNA donor and acceptor sites from the DNA sequence. , 1991, Journal of molecular biology.

[19]  J. Oncina,et al.  INFERRING REGULAR LANGUAGES IN POLYNOMIAL UPDATED TIME , 1992 .

[20]  Liaofu Luo,et al.  Splice site prediction with quadratic discriminant analysis using diversity measure. , 2003, Nucleic acids research.

[21]  Gustavo A. Stolovitzky,et al.  Bioinformatics: The Machine Learning Approach , 2002 .

[22]  Michael Q. Zhang,et al.  A weight array method for splicing signal analysis , 1993, Comput. Appl. Biosci..

[23]  R. Durbin,et al.  RNA sequence analysis using covariance models. , 1994, Nucleic acids research.

[24]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[25]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[26]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[27]  Simon Kasif,et al.  A System for Induction of Oblique Decision Trees , 1994, J. Artif. Intell. Res..