A novel method of recognizing short coding sequences of human genes

In this paper, we present a novel feature representation of DNA sequences based on the graphical representation. Support vector machine (SVM) is applied to classify the coding/non-coding sequence in short human genes. In the process of identifying, we propose an improved self-similar map method to avoid the lack of negative samples sequence. According to the GC content we divide the dataset into several groups and identify these sequences respectively. Finally, the results show that the proposed method obtains a higher accuracy with fewer parameters.

[1]  Chun-Ting Zhang,et al.  Recognizing shorter coding regions of human genes based on the statistics of stop codons. , 2002, Biopolymers.

[2]  Ma Bin-guang A self-similarity-map-based algorithm for generating negative samples and its application in prokaryotic gene recognition , 2004 .

[3]  R Zhang,et al.  Z curves, an intutive tool for visualizing and analyzing the DNA sequences. , 1994, Journal of biomolecular structure & dynamics.

[4]  Jean-Michel Claverie,et al.  Heuristic informational analysis of sequences , 1986, Nucleic Acids Res..

[5]  C. Zhang,et al.  Identification of protein-coding genes in the genome of Vibrio cholerae with more than 98% accuracy using occurrence frequencies of single nucleotides. , 2001, European journal of biochemistry.

[6]  M. Yan,et al.  A new fourier transform approach for protein coding measure based on the format of the Z curve , 1998, Bioinform..

[7]  E. Hamori,et al.  H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences. , 1983, The Journal of biological chemistry.

[8]  A. D. McLachlan,et al.  Codon preference and its use in identifying protein coding regions in long DNA sequences , 1982, Nucleic Acids Res..

[9]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[10]  Li Yang,et al.  New invariant of DNA sequence based on 3DD‐curves and its application on phylogeny , 2007, J. Comput. Chem..

[11]  Feng Gao,et al.  Comparison of various algorithms for recognizing short coding sequences of human genes , 2004, Bioinform..

[12]  D. Arquès,et al.  Periodicities in coding and noncoding regions of the genes. , 1990, Journal of theoretical biology.

[13]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[14]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[15]  C. Zhang,et al.  Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve. , 2000, Nucleic acids research.

[16]  Bo Liao,et al.  A 2D graphical representation of DNA sequence , 2005 .

[17]  Hong Yan,et al.  Classification of short human exons and introns based on statistical features. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[18]  A A Tsonis,et al.  Periodicity in DNA coding sequences: implications in gene evolution. , 1991, Journal of theoretical biology.

[19]  Yan-Da Li,et al.  Identifying splicing sites in eukaryotic RNA: support vector machine approach , 2003, Comput. Biol. Medicine.

[20]  H. Joel Jeffrey,et al.  Chaos game visualization of sequences , 1992, Comput. Graph..