Effective DNA Encoding for Splice Site Prediction Using SVM

Splice site prediction in the pre-mRNA is a very important task for understanding gene structure and its function. To predict splice sites, SVM (support vector machine)-based classification technique is frequently used because of its classification accuracy. High performance of SVM largely depends on DNA encoding method. However, existing encoding approaches do not reveal the characteristics of DNA sequences very well enough to provide as much information as sequences have. In this paper, we propose new effective DNA encoding method for feature extraction which can give more information of DNA sequence. Our encoding method can provide density information of each nucleotide along with positional information and chemical property. Extensive performance study shows that the proposed method can provide better performance than existing encoding methods based on several performance criteria such as classification accuracy, sensitivity, specificity and auROC (area under receiver operating characteristicscurve).

[1]  Qingshan Jiang,et al.  A New Classification Method for Human Gene Splice Site Prediction , 2012, HIS.

[2]  J. Huang,et al.  An approach of encoding for prediction of splice sites using SVM. , 2006, Biochimie.

[3]  Kequan Ding,et al.  A 4D representation of DNA sequences and its application , 2005 .

[4]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Data Mining Researchers , 2003 .

[5]  Ali Iranmanesh,et al.  A Novel Graphical and Numerical Representation for Analyzing DNA Sequences Based on Codons , 2012 .

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  Luciano Milanesi,et al.  Analysis of donor splice sites in different eukaryotic organisms , 1997, Journal of Molecular Evolution.

[8]  Peter G. Korning,et al.  Splice Site Prediction in Arabidopsis Thaliana Pre-mRNA by Combining Local and Global Sequence Information , 1996 .

[9]  Feng Liu,et al.  Splice Site Prediction using Support Vector Machines with Context-Sensitive Kernel Functions , 2009, J. Univers. Comput. Sci..

[10]  Loris Nanni,et al.  Identifying splice-junction sequences by hierarchical multiclassifier , 2006, Pattern Recognit. Lett..

[11]  S. Salzberg,et al.  GeneSplicer: a new computational method for splice site prediction. , 2001, Nucleic acids research.

[12]  Guohua Huang,et al.  H-L curve: A Novel 2D Graphical Representation of Protein Sequences , 2009 .

[13]  Kay C. Wiese,et al.  Improving splice-junctions classification employing a novel encoding schema and decision-tree , 2011, 2011 IEEE Congress of Evolutionary Computation (CEC).

[14]  Milan Randic,et al.  A novel 2-D graphical representation of DNA sequences of low degeneracy , 2001 .

[15]  Xiao Sun,et al.  Analysis of Similarities/Dissimilarities of DNA Sequences Based on a Novel Graphical Representation , 2010 .

[16]  Jagath C. Rajapakse,et al.  Markov encoding for detecting signals in genomic sequences , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[17]  Chartchalerm Isarankura-Na-Ayudhya,et al.  Recognition of DNA Splice Junction via Machine Learning Approaches , 2005 .

[18]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[19]  Yixin Chen,et al.  Splice site prediction using support vector machines with a Bayes kernel , 2006, Expert Syst. Appl..

[20]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[21]  Yan-Da Li,et al.  Identifying splicing sites in eukaryotic RNA: support vector machine approach , 2003, Comput. Biol. Medicine.

[22]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[23]  S Brunak,et al.  A branch point consensus from Arabidopsis found by non-circular analysis allows for better prediction of acceptor sites. , 1997, Nucleic acids research.

[24]  Saman K. Halgamuge,et al.  Fast splice site detection using information content and feature reduction , 2008, BMC Bioinformatics.

[25]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[26]  V. Brendel,et al.  Logitlinear models for the prediction of splice sites in plant pre-mRNA sequences. , 1996, Nucleic acids research.

[27]  Jason Li,et al.  Splice site identification using probabilistic parameters and SVM classification , 2006, BMC Bioinformatics.

[28]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .