Prediction of Protein Coding Regions by Support Vector Machine

With the exponential growth of genomic sequences, there is an increasing demand to accurately identify protein coding regions from genomic sequences. Despite many progresses being made in the identification of protein coding regions by computational methods during recent years, the performances and efficiencies of the prediction methods still need to be improved. A novel method to predict the position of coding regions is proposed. First, a support vector machine is used as a classifier to recognize the first nucleotide of a codon in a coding region. Then, according to the difference of the time frequency characteristics of the output values of the classifier analyzed by Short Time Fourier Transform, the position of coding regions can be accurately determinate. The algorithm is not only can predict coding regions, but also can identify the first nucleotide of the codon in coding regions. This is very significant for accurate translation into a protein sequence. The simulation results show the proposed method is more effective for coding regions prediction than the existing coding region discovery tools.

[1]  Daniel G. Brown,et al.  The most probable annotation problem in HMMs and its application to bioinformatics , 2007, J. Comput. Syst. Sci..

[2]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[3]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[4]  Martin Reczko,et al.  Functional site prediction on the DNA sequence by artificial neural networks , 1996, Proceedings IEEE International Joint Symposia on Intelligence and Systems.

[5]  P. P. Va,et al.  Digital filters for gene prediction applications , 2002 .

[6]  Yan-Da Li,et al.  Identifying splicing sites in eukaryotic RNA: support vector machine approach , 2003, Comput. Biol. Medicine.

[7]  Igor V Tetko,et al.  Separation of sequences from host-pathogen interface using triplet nucleotide frequencies. , 2007, Fungal genetics and biology : FG & B.

[8]  Jason Tsong-Li Wang,et al.  GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences , 2004, Inf. Sci..

[9]  P Bork,et al.  Homology-based gene prediction using neural nets. , 1998, Analytical biochemistry.

[10]  Thomas F. Quatieri,et al.  Short-time Fourier transform , 1987 .

[11]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[12]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[13]  Insuk Sohn,et al.  Informative transcription factor selection using support vector machine-based generalized approximate cross validation criteria , 2009, Comput. Stat. Data Anal..

[14]  Changchuan Yin,et al.  Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. , 2007, Journal of theoretical biology.

[15]  S. O. Aase,et al.  Eukaryotic Gene Prediction by Spectral Analysis and Pattern Recognition Techniques , 2006, Proceedings of the 7th Nordic Signal Processing Symposium - NORSIG 2006.

[16]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[17]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.