DNA Sequence Classification Using DAWGs

DNA sequence classification involves attributing sub-strings or words within a sequence to known distinct sequence classes. A query sequence was classified by comparing all of its words to words in databases representative of three classes of DNA, transcriptional promoters, exons and introns. The efficiency of this comparision was increased by constructing directed, acyclic word graphs (DAWGs) of all sequences and databases. The resulting landscape was scored to determine the preference of words in the query sequence for any one particular database class. Using this approach it was possible to detect 94% of a test set of individual promoter sequences, with only 4% incorrect detection of test exon sequences as promoters. Preliminary attempts were made to parse genomic DNA into promoter, exon and intron regions. Initial results indicate that a reasonably high degree of correlation exists between the predicted regions and known promoter-exon-intron domains.

[1]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[2]  Chris A. Fields,et al.  gm: a practical tool for automating DNA sequence analysis , 1990, Comput. Appl. Biosci..

[3]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[4]  David Haussler,et al.  The Smallest Automaton Recognizing the Subwords of a Text , 1985, Theor. Comput. Sci..

[5]  E. Snyder,et al.  Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. , 1993, Nucleic acids research.

[6]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[7]  V. Solovyev,et al.  Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. , 1994, Nucleic acids research.

[8]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[9]  David Haussler,et al.  Building the Minimal DFA for the Set of all Subwords of a Word On-line in Linear Time , 1984, ICALP.

[10]  D. S. Prestridge Predicting Pol II promoter sequences using transcription factor binding sites. , 1995, Journal of molecular biology.

[11]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[12]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[13]  David Haussler,et al.  Optimally Parsing a Sequence into Different Classes Based on Multiple Types of Evidence , 1994, ISMB.

[14]  Gary D. Stormo,et al.  PromFD 1.0: a computer program that predicts eukaryotic pol II promoters using strings and IMD matrices , 1997, Comput. Appl. Biosci..

[15]  G. B. Hutchinson,et al.  The prediction of vertebrate promoter regions using differential hexamer frequency analysis , 1996, Comput. Appl. Biosci..

[16]  E. Snyder,et al.  Identification of protein coding regions in genomic DNA. , 1995, Journal of molecular biology.

[17]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[18]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.