Improved identification of conserved cassette exons using Bayesian networks

BackgroundAlternative splicing is a major contributor to the diversity of eukaryotic transcriptomes and proteomes. Currently, large scale detection of alternative splicing using expressed sequence tags (ESTs) or microarrays does not capture all alternative splicing events. Moreover, for many species genomic data is being produced at a far greater rate than corresponding transcript data, hence in silico methods of predicting alternative splicing have to be improved.ResultsHere, we show that the use of Bayesian networks (BNs) allows accurate prediction of evolutionary conserved exon skipping events. At a stringent false positive rate of 0.5%, our BN achieves an improved true positive rate of 61%, compared to a previously reported 50% on the same dataset using support vector machines (SVMs). Incorporating several novel discriminative features such as intronic splicing regulatory elements leads to the improvement. Features related to mRNA secondary structure increase the prediction performance, corroborating previous findings that secondary structures are important for exon recognition. Random labelling tests rule out overfitting. Cross-validation on another dataset confirms the increased performance. When using the same dataset and the same set of features, the BN matches the performance of an SVM in earlier literature. Remarkably, we could show that about half of the exons which are labelled constitutive but receive a high probability of being alternative by the BN, are in fact alternative exons according to the latest EST data. Finally, we predict exon skipping without using conservation-based features, and achieve a true positive rate of 29% at a false positive rate of 0.5%.ConclusionBNs can be used to achieve accurate identification of alternative exons and provide clues about possible dependencies between relevant features. The near-identical performance of the BN and SVM when using the same features shows that good classification depends more on features than on the choice of classifier. Conservation based features continue to be the most informative, and hence distinguishing alternative exons from constitutive ones without using conservation based features remains a challenging problem.

[1]  Nir Friedman,et al.  Modeling dependencies in protein-DNA binding sites , 2003, RECOMB '03.

[2]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[3]  Yves Moreau,et al.  Analysis of HIV-1 pol sequences using Bayesian Networks: implications for drug resistance , 2006, Bioinform..

[4]  Ron Shamir,et al.  Accurate identification of alternatively spliced exons using support vector machine , 2005, Bioinform..

[5]  M. Hiller,et al.  Using RNA secondary structures to guide sequence motif finding towards single-stranded regions , 2006, Nucleic acids research.

[6]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[7]  Tomaso Poggio,et al.  Identification and analysis of alternative splicing events conserved in human and mouse. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[8]  S. Berget,et al.  Architectural limits on split genes. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Christina Waldsich,et al.  RNA folding in vivo. , 2002, Current opinion in structural biology.

[10]  G. Ast,et al.  Comparative analysis identifies exonic splicing regulatory sequences--The complex definition of enhancers and silencers. , 2006, Molecular cell.

[11]  Charles X. Ling,et al.  AUC: A Better Measure than Accuracy in Comparing Learning Algorithms , 2003, Canadian Conference on AI.

[12]  Phil Green,et al.  Differing patterns of selection in alternative and constitutive splice sites. , 2007, Genome research.

[13]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[14]  B. Graveley Alternative splicing: increasing diversity in the proteomic world. , 2001, Trends in genetics : TIG.

[15]  Gene W. Yeo,et al.  Discovery and Analysis of Evolutionarily Conserved Intronic Splicing Regulatory Elements , 2007, PLoS Genetics.

[16]  Gunnar Rätsch,et al.  RASE: recognition of alternatively spliced exons in C.elegans , 2005, ISMB.

[17]  Chung-Chin Lu,et al.  Prediction of splice sites with dependency graphs and their expanded bayesian networks , 2005, Bioinform..

[18]  T A Thanaraj,et al.  Categorization and characterization of transcript-confirmed constitutively and alternatively spliced introns and exons from human. , 2002, Human molecular genetics.

[19]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[20]  P. Ja,et al.  Inference in Bayesian Networks , 1999, AI Mag..

[21]  Rolf Backofen,et al.  Pre-mRNA Secondary Structures Influence Exon Recognition , 2007, PLoS genetics.

[22]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[23]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[24]  Christopher B. Burge,et al.  Recognition of Unknown Conserved Alternatively Spliced Exons , 2005, PLoS Comput. Biol..

[25]  J. Berglund,et al.  A comprehensive computational characterization of conserved mammalian intronic sequences reveals conserved motifs associated with constitutive and alternative splicing. , 2007, Genome research.

[26]  R. Amann,et al.  Predictive Identification of Exonic Splicing Enhancers in Human Genes , 2022 .

[27]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[28]  Rolf Backofen,et al.  A multiple-feature framework for modelling and predicting transcription factor binding sites , 2005, Bioinform..

[29]  Christopher W. J. Smith,et al.  Novel modes of splicing repression by PTB. , 2006, Trends in biochemical sciences.

[30]  K. Huse,et al.  Non-EST based prediction of exon skipping and intron retention events using Pfam information , 2005, Nucleic acids research.

[31]  B. Blencowe Alternative Splicing: New Insights from Global Analyses , 2006, Cell.

[32]  Yimeng Dou,et al.  Genomic splice-site analysis reveals frequent alternative splicing close to the dominant splice site. , 2006, RNA.

[33]  Gil Ast,et al.  Comparative analysis detects dependencies among the 5' splice-site positions. , 2004, RNA.

[34]  Qi Wang,et al.  Bioinformatics analysis of alternative splicing , 2005, Briefings Bioinform..

[35]  L. Chasin,et al.  Computational definition of sequence motifs governing constitutive exon splicing. , 2004, Genes & development.

[36]  Simon Kasif,et al.  Modeling splice sites with Bayes networks , 2000, Bioinform..

[37]  Peter F. Stadler,et al.  Thermodynamics of RNA-RNA Binding , 2006, German Conference on Bioinformatics.

[38]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[39]  Christopher B. Burge,et al.  Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals , 2003, RECOMB '03.

[40]  Ron Shamir,et al.  A non-EST-based method for exon-skipping prediction. , 2004, Genome research.

[41]  T. Cooper,et al.  The CELF Family of RNA Binding Proteins Is Implicated in Cell-Specific and Developmentally Regulated Alternative Splicing , 2001, Molecular and Cellular Biology.

[42]  Brenton R Graveley,et al.  A computational and experimental approach toward a priori identification of alternatively spliced exons. , 2004, RNA.

[43]  B. Frey,et al.  Alternative splicing of conserved exons is frequently species-specific in human and mouse. , 2005, Trends in genetics : TIG.

[44]  David Haussler,et al.  Transcriptome and Genome Conservation of Alternative Splicing Events in Humans and Mice , 2003, Pacific Symposium on Biocomputing.

[45]  B. Rannala,et al.  The Bayesian revolution in genetics , 2004, Nature Reviews Genetics.

[46]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[47]  E. Buratti,et al.  Influence of RNA Secondary Structure on the Pre-mRNA Splicing Process , 2004, Molecular and Cellular Biology.

[48]  Rolf Backofen,et al.  BioBayesNet: a web server for feature extraction and Bayesian network modeling of biological sequence data , 2007, Nucleic Acids Res..

[49]  Rolf Backofen,et al.  Widespread occurrence of alternative splicing at NAGNAG acceptors contributes to proteome plasticity , 2004, Nature Genetics.

[50]  Gene W. Yeo,et al.  Inference of Splicing Regulatory Activities by Sequence Neighborhood Analysis , 2006, PLoS genetics.

[51]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[52]  Gil Ast,et al.  The Emergence of Alternative 3′ and 5′ Splice Site Exons from Constitutive Exons , 2007, PLoS Comput. Biol..

[53]  Robi David Mitra,et al.  Non-EST-based prediction of novel alternatively spliced cassette exons with cell signaling function in Caenorhabditis elegans and human , 2007, Nucleic acids research.

[54]  Christopher J. Lee,et al.  Genome-wide detection of alternative splicing in expressed sequences of human genes , 2001, Nucleic Acids Res..