Parallel Cascade Recognition of Exon and Intron DNA Sequences

AbstractMany of the current procedures for detecting coding regions on human DNA sequences combine a number of individual techniques such as discriminant analysis and neural net methods. Recent papers have used techniques from nonlinear systems identification, in particular, parallel cascade identification (PCI), as one means for classifying protein sequences into their structure/function groups. In the present paper, PCI is used in a pilot study to distinguish exon (coding) from intron (noncoding; interspersed within genes) human DNA sequences. Only the first exon and first intron sequences with known boundaries in genomic DNA from the βT-cell receptor locus were used for training. Then, the parallel cascade classifiers were able to achieve classification rates of about 89% on novel sequences in a test set, and averaged about 82% when results of a blind test were included. In testing over a much wider range of human nucleotide sequences, PCI classifiers averaged 83.6% correct classifications. These results indicate that parallel cascade classifiers may be useful components in future coding region detection programs. © 2002 Biomedical Engineering Society. PAC2002: 8715Cc, 8714Gg, 8715Aa

[1]  C. Peng,et al.  Long-range correlations in nucleotide sequences , 1992, Nature.

[2]  E. A. Cheever,et al.  Using signal processing techniques for DNA sequence comparison , 1989, Proceedings of the Fifteenth Annual Northeast Bioengineering Conference.

[3]  S V Buldyrev,et al.  Average mutual information of coding and noncoding DNA. , 2000, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[4]  C. DeLisi,et al.  Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins. , 1987, Journal of molecular biology.

[5]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[6]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Yin Xu,et al.  An Improved System for Exon Recognition and Gene Modeling in Human DNA Sequence , 1994, ISMB.

[8]  N V Dokholyan,et al.  Distributions of dimeric tandem repeats in non-coding and coding DNA sequences. , 2000, Journal of theoretical biology.

[9]  Moira Ellen Regelson Protein structure/function classification using hidden Markov models , 1997 .

[10]  Michael Ruogu Zhang,et al.  Identification of protein coding regions in the human genome by quadratic discriminant analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Michael J. Korenberg,et al.  Parallel cascade identification and kernel estimation for nonlinear systems , 2006, Annals of Biomedical Engineering.

[12]  Michael J. Korenberg,et al.  Parallel cascade identification as a means for automatically classifying protein sequences into structure/function groups , 2000, Biological Cybernetics.

[13]  A L Goldberger,et al.  Correlation approach to identify coding regions in DNA sequences. , 1994, Biophysical journal.

[14]  L. Hood,et al.  The Complete 685-Kilobase DNA Sequence of the Human β T Cell Receptor Locus , 1996, Science.

[15]  L. Hood,et al.  The complete 685-kilobase DNA sequence of the human beta T cell receptor locus. , 1996, Science.

[16]  M. Korenberg Statistical Identification of Parallel Cascades of Linear and Nonlinear Systems , 1982 .

[17]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[18]  Ian W. Hunter,et al.  Automatic Classification of Protein Sequences into Structure/Function Groups via Parallel Cascade Identification: A Feasibility Study , 2000, Annals of Biomedical Engineering.

[19]  G. Palm,et al.  On representation and approximation of nonlinear systems , 1979, Biological Cybernetics.

[20]  N. Wiener,et al.  Nonlinear Problems in Random Theory , 1964 .

[21]  J. Fickett,et al.  Predictive methods using nucleotide sequences. , 2006, Methods of biochemical analysis.