A novel splice site prediction method using support vector machine

We present a novel classification method for splice sites prediction using support vector machine (SVM). The method first represents input sequences by sequence-based features, including the information of the distribution of tri-nucleotides and the conserved features surrounding the splice sites characterized by Markov model. An F-score based feature selection method is then used to select informative features to improve the performance. Finally, SVM is employed to classify the splice sites with the selected features. Experimental results show that this method improves splice site prediction accuracy and performs better than the existing methods such as MM1-SVM, Reduced MM1-SVM and some other methods.

[1]  Christopher B. Burge,et al.  Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals , 2003, RECOMB '03.

[2]  Shengrui Wang,et al.  A novel hierarchical clustering algorithm for gene sequences , 2012, BMC Bioinformatics.

[3]  Yixin Chen,et al.  Splice site prediction using support vector machines with a Bayes kernel , 2006, Expert Syst. Appl..

[4]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[5]  L. Hurst,et al.  Exonic splicing regulatory elements skew synonymous codon usage near intron-exon boundaries in mammals. , 2007, Molecular biology and evolution.

[6]  Jason Li,et al.  Splice site identification using probabilistic parameters and SVM classification , 2006, BMC Bioinformatics.

[7]  Gunnar Rätsch,et al.  Accurate splice site prediction using support vector machines , 2007, BMC Bioinformatics.

[8]  Jagath C. Rajapakse,et al.  Markov encoding for detecting signals in genomic sequences , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Chung-Chin Lu,et al.  Prediction of splice sites with dependency graphs and their expanded bayesian networks , 2005, Bioinform..

[10]  Michael Q. Zhang,et al.  A weight array method for splicing signal analysis , 1993, Comput. Appl. Biosci..

[11]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[12]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[13]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[14]  Christoforos Nikolaou,et al.  Measuring the Coding Potential of Genomic Sequences Througha Combination of Triplet Occurrence Patterns and RNY Preference , 2004, Journal of Molecular Evolution.

[15]  Salvatore Rampone,et al.  HS3D: Homo Sapiens Splice Site Data Set , 2002 .

[16]  A. D. McLachlan,et al.  Codon preference and its use in identifying protein coding regions in long DNA sequences , 1982, Nucleic Acids Res..

[17]  C. Burge,et al.  A computational analysis of sequence features involved in recognition of short introns , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Xiaolei Wang,et al.  Similarity analysis of DNA sequences based on the weighted pseudo‐entropy , 2011, J. Comput. Chem..

[19]  S. Salzberg,et al.  GeneSplicer: a new computational method for splice site prediction. , 2001, Nucleic acids research.

[20]  Yvan Saeys,et al.  SpliceMachine: predicting splice sites from high-dimensional local context representations , 2005, Bioinform..

[21]  M. Brent,et al.  Recent advances in gene structure prediction. , 2004, Current opinion in structural biology.

[22]  Saman K. Halgamuge,et al.  Fast splice site detection using information content and feature reduction , 2008, BMC Bioinformatics.

[23]  Chih-Jen Lin,et al.  Combining SVMs with Various Feature Selection Strategies , 2006, Feature Extraction.

[24]  Jing Li,et al.  Splice sites prediction of Human genome using length-variable Markov model and feature selection , 2010, Expert Syst. Appl..

[25]  Ren Zhang,et al.  Evaluation of Gene-Finding Algorithms by a Content- Balancing Accuracy Index , 2002, Journal of biomolecular structure & dynamics.