Segmentation of short human exons based on spectral features of double curves

This paper presents a new segmentation method based on spectral analysis to locate borders between short protein coding regions and non-coding regions. We formulate the innovative double curve representation of a DNA sequence and apply local three-codon measurement on the discrete Fourier spectral features at 1/3 frequency to identify short protein coding regions. The proposed spectral segmentation method based on double curves requires no prior knowledge of the DNA data. Our simulation results show that the proposed spectral method greatly improves the accuracy of identifying short coding regions in DNA sequences compared with the results obtained from the other methods that analyse DNA sequences directly.

[1]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[2]  R. Linsker,et al.  A measure of DNA periodicity. , 1986, Journal of theoretical biology.

[3]  Chun-Ting Zhang,et al.  Recognizing shorter coding regions of human genes based on the statistics of stop codons. , 2002, Biopolymers.

[4]  Dimitris Anastassiou,et al.  Frequency-domain analysis of biomolecular sequences , 2000, Bioinform..

[5]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[6]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[7]  Yizhar Lavner,et al.  Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. , 2003, Genome research.

[8]  S. Tiwari,et al.  Prediction of probable genes by Fourier analysis of genomic sequences , 1997, Comput. Appl. Biosci..

[9]  Hong Yan,et al.  Classification of short human exons and introns based on statistical features. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[10]  R Zhang,et al.  A novel approach to distinguish between intron-containing and intronless genes based on the format of Z curves. , 1998, Journal of theoretical biology.

[11]  Dimitris Anastassiou,et al.  Genomic signal processing , 2001, IEEE Signal Process. Mag..

[12]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[13]  M. Gouy,et al.  Codon catalog usage and the genome hypothesis. , 1980, Nucleic acids research.

[14]  Simon Kasif,et al.  Microbial gene identification using interpolated Markov , 1998 .

[15]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[16]  Alan Wee-Chung Liew,et al.  DB-Curve: a novel 2D method of DNA sequence visualization and representation , 2003 .

[17]  Steven Salzberg,et al.  A Decision Tree System for Finding Genes in DNA , 1998, J. Comput. Biol..

[18]  A. Lapedes,et al.  Determination of eukaryotic protein coding regions using neural networks and information theory. , 1992, Journal of molecular biology.

[19]  J. Hawkins A survey on intron and exon lengths. , 1988, Nucleic acids research.

[20]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[21]  Steven L Salzberg,et al.  Computational discovery of internal micro-exons. , 2003, Genome research.

[22]  James W. Fickett,et al.  The Gene Identification Problem: An Overview for Developers , 1995, Comput. Chem..

[23]  D. Church,et al.  Spidey: a tool for mRNA-to-genomic alignments. , 2001, Genome research.

[24]  A. D. McLachlan,et al.  Codon preference and its use in identifying protein coding regions in long DNA sequences , 1982, Nucleic Acids Res..

[25]  J. Hawkins,et al.  A survey on intron and exon lengths. , 1988, Nucleic acids research.

[26]  T A Thanaraj,et al.  Positional characterisation of false positives from computational prediction of human splice sites. , 2000, Nucleic acids research.

[27]  D C Shields,et al.  Codon usage patterns in Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster and Homo sapiens; a review of the considerable within-species diversity. , 1988, Nucleic acids research.