Accurate identification of alternatively spliced exons using support vector machine

MOTIVATION Alternative splicing is a major component of the regulatory action on mammalian transcriptomes. It is estimated that over half of all human genes have more than one splice variant. Previous studies have shown that alternatively spliced exons possess several features that distinguish them from constitutively spliced ones. Recently, we have demonstrated that such features can be used to distinguish alternative from constitutive exons. In the current study, we used advanced machine learning methods to generate robust classifier of alternative exons. RESULTS We extracted several hundred local sequence features of constitutive as well as alternative exons. Using feature selection methods we find seven attributes that are dominant for the task of classification. Several less informative features help to slightly increase the performance of the classifier. The classifier achieves a true positive rate of 50% for a false positive rate of 0.5%. This result enables one to reliably identify alternatively spliced exons in exon databases that are believed to be dominated by constitutive exons.

[1]  Ron Shamir,et al.  A non-EST-based method for exon-skipping prediction. , 2004, Genome research.

[2]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[3]  David States,et al.  Selecting for functional alternative splices in ESTs. , 2002, Genome research.

[4]  Yan-Da Li,et al.  Identifying splicing sites in eukaryotic RNA: support vector machine approach , 2003, Comput. Biol. Medicine.

[5]  M. Gelfand,et al.  Frequent alternative splicing of human genes. , 1999, Genome research.

[6]  T A Thanaraj,et al.  Categorization and characterization of transcript-confirmed constitutively and alternatively spliced introns and exons from human. , 2002, Human molecular genetics.

[7]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[8]  R. Sorek,et al.  Intronic sequences flanking alternatively spliced exons are conserved between human and mouse. , 2003, Genome research.

[9]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[10]  Bosiljka Tasic,et al.  Alternative pre-mRNA splicing and proteome expansion in metazoans , 2002, Nature.

[11]  Dan Graur,et al.  Minimal conditions for exonization of intronic sequences: 5' splice site formation in alu exons. , 2004, Molecular cell.

[12]  B. Graveley Alternative splicing: increasing diversity in the proteomic world. , 2001, Trends in genetics : TIG.

[13]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[14]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[15]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[16]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[17]  A. Weiner,et al.  A compensatory base change in U1 snRNA suppresses a 5′ splice site mutation , 1986, Cell.

[18]  Christopher J. Lee,et al.  Genome-wide detection of alternative splicing in expressed sequences of human genes , 2001, Nucleic Acids Res..

[19]  R. Shamir,et al.  How prevalent is functional alternative splicing in the human genome? , 2004, Trends in genetics : TIG.

[20]  A. Krainer,et al.  Listening to silence and understanding nonsense: exonic mutations that affect splicing , 2002, Nature Reviews Genetics.

[21]  O. Gotoh,et al.  Detection of the Splicing Sites with Kernel Method Approaches Dealing with Nucleotide Doublets , 2003 .

[22]  J. Castle,et al.  Genome-Wide Survey of Human Alternative Pre-mRNA Splicing with Exon Junction Microarrays , 2003, Science.

[23]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[24]  P Bork,et al.  EST comparison indicates 38% of human mRNAs contain possible alternative splice forms , 2000, FEBS letters.

[25]  S. Sathiya Keerthi,et al.  Evaluation of simple performance measures for tuning SVM hyperparameters , 2003, Neurocomputing.

[26]  Charles X. Ling,et al.  AUC: A Better Measure than Accuracy in Comparing Learning Algorithms , 2003, Canadian Conference on AI.

[27]  M. S. Brown,et al.  Support Vector Machine Classification of Microarray from Gene Expression Data , 1999 .

[28]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[29]  Christopher J. Lee,et al.  Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss , 2003, Nature Genetics.

[30]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[31]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[32]  K. Heller,et al.  Sequence information for the splicing of human pre-mRNA identified by support vector machine classification. , 2003, Genome research.

[33]  Christopher J. Lee,et al.  A genomic view of alternative splicing , 2002, Nature Genetics.

[34]  Dan Roth,et al.  Generalization Bounds for the Area Under the ROC Curve , 2005, J. Mach. Learn. Res..

[35]  W. Gish,et al.  Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. , 2001, Genome research.