Classification of short human exons and introns based on statistical features.

The classification of human gene sequences into exons and introns is a difficult problem in DNA sequence analysis. In this paper, we define a set of features, called the simple Z (SZ) features, which is derived from the Z-curve features for the recognition of human exons and introns. The classification results show that SZ features, while fewer in numbers (three in total), can preserve the high recognition rate of the original nine Z-curve features. Since the size of SZ features is one-third of the Z-curve features, the dimensionality of the feature space is much smaller, and better recognition efficiency is achieved. If the stop codon feature is used together with the three SZ features, a recognition rate of up to 92% for short sequences of length <140 bp can be obtained.

[1]  C. Zhang,et al.  Identification of protein-coding genes in the genome of Vibrio cholerae with more than 98% accuracy using occurrence frequencies of single nucleotides. , 2001, European journal of biochemistry.

[2]  R Zhang,et al.  Z curves, an intutive tool for visualizing and analyzing the DNA sequences. , 1994, Journal of biomolecular structure & dynamics.

[3]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[4]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[5]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[6]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[7]  C. Zhang,et al.  A graphic approach to analyzing codon usage in 1562 Escherichia coli protein coding sequences. , 1994, Journal of molecular biology.

[8]  C T Zhang A symmetrical theory of DNA sequences and its applications. , 1997, Journal of theoretical biology.

[9]  A. Lapedes,et al.  Determination of eukaryotic protein coding regions using neural networks and information theory. , 1992, Journal of molecular biology.

[10]  Feng-Biao Guo,et al.  ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. , 2003, Nucleic acids research.

[11]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[12]  Ju Wang,et al.  Using a Euclid Distance Discriminant Method to Find Protein Coding Genes in the Yeast Genome , 2002, Comput. Chem..

[13]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[14]  T A Thanaraj,et al.  Positional characterisation of false positives from computational prediction of human splice sites. , 2000, Nucleic acids research.

[15]  J W Fickett,et al.  Finding genes by computer: the state of the art. , 1996, Trends in genetics : TIG.

[16]  Michael Ruogu Zhang,et al.  Identification of protein coding regions in the human genome by quadratic discriminant analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[17]  A. D. McLachlan,et al.  Codon preference and its use in identifying protein coding regions in long DNA sequences , 1982, Nucleic Acids Res..

[18]  J. Hawkins,et al.  A survey on intron and exon lengths. , 1988, Nucleic acids research.

[19]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[20]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[21]  Chun-Ting Zhang,et al.  Recognizing shorter coding regions of human genes based on the statistics of stop codons. , 2002, Biopolymers.

[22]  C. Zhang,et al.  Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve. , 2000, Nucleic acids research.

[23]  Steven Salzberg,et al.  A Decision Tree System for Finding Genes in DNA , 1998, J. Comput. Biol..