Comparative Analyses between Retained Introns and Constitutively Spliced Introns in Arabidopsis thaliana Using Random Forest and Support Vector Machine

One of the important modes of pre-mRNA post-transcriptional modification is alternative splicing. Alternative splicing allows creation of many distinct mature mRNA transcripts from a single gene by utilizing different splice sites. In plants like Arabidopsis thaliana, the most common type of alternative splicing is intron retention. Many studies in the past focus on positional distribution of retained introns (RIs) among different genic regions and their expression regulations, while little systematic classification of RIs from constitutively spliced introns (CSIs) has been conducted using machine learning approaches. We used random forest and support vector machine (SVM) with radial basis kernel function (RBF) to differentiate these two types of introns in Arabidopsis. By comparing coordinates of introns of all annotated mRNAs from TAIR10, we obtained our high-quality experimental data. To distinguish RIs from CSIs, We investigated the unique characteristics of RIs in comparison with CSIs and finally extracted 37 quantitative features: local and global nucleotide sequence features of introns, frequent motifs, the signal strength of splice sites, and the similarity between sequences of introns and their flanking regions. We demonstrated that our proposed feature extraction approach was more accurate in effectively classifying RIs from CSIs in comparison with other four approaches. The optimal penalty parameter C and the RBF kernel parameter in SVM were set based on particle swarm optimization algorithm (PSOSVM). Our classification performance showed F-Measure of 80.8% (random forest) and 77.4% (PSOSVM). Not only the basic sequence features and positional distribution characteristics of RIs were obtained, but also putative regulatory motifs in intron splicing were predicted based on our feature extraction approach. Clearly, our study will facilitate a better understanding of underlying mechanisms involved in intron retention.

[1]  D. Solnick Alternative splicing caused by RNA secondary structure , 1985, Cell.

[2]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[3]  Bernhard Schölkopf,et al.  Comparing support vector machines with Gaussian kernels to radial basis function classifiers , 1997, IEEE Trans. Signal Process..

[4]  Yue Shi,et al.  A modified particle swarm optimizer , 1998, 1998 IEEE International Conference on Evolutionary Computation Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98TH8360).

[5]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[6]  Igor V. Pletnev,et al.  Drug Discovery Using Support Vector Machines. The Case Studies of Drug-likeness, Agrochemical-likeness, and Enzyme Inhibition Predictions , 2003, J. Chem. Inf. Comput. Sci..

[7]  Gisbert Schneider,et al.  Support vector machine applications in bioinformatics. , 2003, Applied bioinformatics.

[8]  Fuliang Yin,et al.  Advances in Neural Networks – ISNN 2004 , 2004, Lecture Notes in Computer Science.

[9]  Gene W. Yeo,et al.  Systematic Identification and Analysis of Exonic Splicing Silencers , 2004, Cell.

[10]  Meena Kishore Sakharkar,et al.  Distributions of exons and introns in the human genome , 2004, Silico Biol..

[11]  R. Ophir,et al.  Intron retention is a major phenomenon in alternative splicing in Arabidopsis. , 2004, The Plant journal : for cell and molecular biology.

[12]  Gene W. Yeo,et al.  Variation in sequence and organization of splicing regulatory elements in vertebrate genes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[13]  N. Di Fonzo,et al.  Low temperature promotes intron retention in two e-cor genes of durum wheat , 2005, Planta.

[14]  Yong Chen,et al.  RBF Kernel Based Support Vector Machine with Universal Approximation and Its Application , 2004, ISNN.

[15]  S. Salzberg,et al.  The Transcriptional Landscape of the Mammalian Genome , 2005, Science.

[16]  Young-Chan Lee,et al.  Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters , 2005, Expert Syst. Appl..

[17]  Thomas D. Wu,et al.  GMAP: a genomic mapping and alignment program for mRNA and EST sequence , 2005, Bioinform..

[18]  Hsuan-Tien Lin A Study on Sigmoid Kernels for SVM and the Training of non-PSD Kernels by SMO-type Methods , 2005 .

[19]  O. Gotoh,et al.  Species-specific variation of alternative splicing and transcriptional initiation in six eukaryotes. , 2005, Gene.

[20]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[21]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[22]  Robert M. Nishikawa,et al.  A study on several Machine-learning methods for classification of Malignant and benign clustered microcalcifications , 2005, IEEE Transactions on Medical Imaging.

[23]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[24]  Liliana Florea,et al.  Bioinformatics of alternative splicing and its regulation , 2006, Briefings Bioinform..

[25]  Stephen M. Mount,et al.  Comprehensive analysis of alternative splicing in rice and comparative analyses with Arabidopsis , 2006, BMC Genomics.

[26]  N. Sakabe,et al.  Sequence features responsible for intron retention in human , 2007, BMC Genomics.

[27]  R. Fluhr,et al.  Whole-genome microarray in Arabidopsis facilitates global analysis of retained introns. , 2006, DNA research : an international journal for rapid publication of reports on genes and genomes.

[28]  Steven Salzberg,et al.  A computational survey of candidate exonic splicing enhancer motifs in the model plant Arabidopsis thaliana , 2007, BMC Bioinformatics.

[29]  V. Brendel,et al.  Genomewide comparative analysis of alternative splicing in plants. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[31]  Cheng-Lung Huang,et al.  A GA-based feature selection and parameters optimizationfor support vector machines , 2006, Expert Syst. Appl..

[32]  B. Rost,et al.  Distinguishing Protein-Coding from Non-Coding RNAs through Support Vector Machines , 2006, PLoS genetics.

[33]  Gene W. Yeo,et al.  Discovery and Analysis of Evolutionarily Conserved Intronic Splicing Regulatory Elements , 2007, PLoS Genetics.

[34]  Constantin F. Aliferis,et al.  Are Random Forests Better than Support Vector Machines for Microarray-Based Cancer Classification? , 2007, AMIA.

[35]  A. Reddy,et al.  Alternative splicing of pre-mRNAs of Arabidopsis serine/arginine-rich proteins: regulation by hormones and stresses. , 2007, The Plant journal : for cell and molecular biology.

[36]  Namshin Kim,et al.  The ASAP II database: analysis and comparative genomics of alternative splicing in 15 animal species , 2006, Nucleic Acids Res..

[37]  G. Ast,et al.  Different levels of alternative splicing among eukaryotes , 2006, Nucleic acids research.

[38]  Cheng-Lung Huang,et al.  A distributed PSO-SVM hybrid system with feature selection and parameter optimization , 2008, Appl. Soft Comput..

[39]  Shih-Wei Lin,et al.  Particle swarm optimization for parameter determination and feature selection of support vector machines , 2008, Expert Syst. Appl..

[40]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.

[41]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[42]  Sylvain Foissac,et al.  A General Definition and Nomenclature for Alternative Splicing Events , 2008, PLoS Comput. Biol..

[43]  W. Barbazuk,et al.  Genome-wide analyses of alternative splicing in plants: opportunities and challenges. , 2008, Genome research.

[44]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[45]  Qian-zhong Li,et al.  One parameter to describe the mechanism of splice sites competition. , 2008, Biochemical and biophysical research communications.

[46]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[47]  Xueying Zhang,et al.  Optimization of SVM Parameters Based on PSO Algorithm , 2009, 2009 Fifth International Conference on Natural Computation.

[48]  M. Torrado,et al.  Intron retention generates ANKRD1 splice variants that are co-regulated with the main transcript in normal and failing myocardium. , 2009, Gene.

[49]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[50]  Guy Nimrod,et al.  Identification of DNA-binding proteins using structural, electrostatic and evolutionary features. , 2009, Journal of molecular biology.

[51]  C. Ben-Dov,et al.  Unconstrained mining of transcript data reveals increased alternative splicing complexity in the human transcriptome , 2010, Nucleic acids research.

[52]  Henry D. Priest,et al.  Genome-wide mapping of alternative splicing in Arabidopsis thaliana. , 2010, Genome research.

[53]  I. Vaisman,et al.  Knowledge-based computational mutagenesis for predicting the disease potential of human non-synonymous single nucleotide polymorphisms. , 2010, Journal of theoretical biology.

[54]  Atsushi Ishigame,et al.  Consideration of Particle Swarm Optimization combined with tabu search , 2010 .

[55]  G. Ast,et al.  Alternative splicing and evolution: diversification, exon definition and function , 2010, Nature Reviews Genetics.

[56]  Gunnar Rätsch,et al.  Support vector machines-based identification of alternative splicing in Arabidopsis thaliana from whole-genome tiling arrays , 2011, BMC Bioinformatics.

[57]  Hong Li,et al.  Prediction of protein structural classes using the theory of increment of diversity and support vector machine , 2011, Wuhan University Journal of Natural Sciences.

[58]  D. Rekosh,et al.  The Tpr protein regulates export of mRNAs with retained introns that traffic through the Nxf1 pathway. , 2011, RNA.

[59]  Yongfeng Jin,et al.  New insights into RNA secondary structure in the alternative splicing of pre-mRNAs , 2011, RNA biology.

[60]  Antanas Verikas,et al.  Mining data with random forests: A survey and results of new tests , 2011, Pattern Recognit..

[61]  Seyed Mohammad Hosseini,et al.  A Novel Weighted Support Vector Machine Based on Particle Swarm Optimization for Gene Selection and Tumor Classification , 2012, Comput. Math. Methods Medicine.

[62]  Renfa Li,et al.  A Novel Composition Coding Method of DNA Sequence and Its Application , 2012 .

[63]  M. Cho,et al.  Classification of savanna tree species, in the Greater Kruger National Park region, by integrating hyperspectral and LiDAR data in a Random Forest data mining environment , 2012 .

[64]  P. Atkinson,et al.  Random Forest classification of Mediterranean land cover using multi-seasonal imagery and multi-seasonal texture , 2012 .

[65]  Anne-Laure Boulesteix,et al.  Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics , 2012, WIREs Data Mining Knowl. Discov..

[66]  Yamile Marquez,et al.  Alternative splicing in plants – coming of age , 2012, Trends in plant science.

[67]  P. Wittkopp,et al.  Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence , 2011, Nature Reviews Genetics.

[68]  Yamile Marquez,et al.  Transcriptome survey reveals increased complexity of the alternative splicing landscape in Arabidopsis , 2012, Genome research.

[69]  A. Bazzan,et al.  RFMirTarget: Predicting Human MicroRNA Target Genes with a Random Forest Classifier , 2013, PloS one.

[70]  A. Loraine,et al.  RNA-Seq of Arabidopsis Pollen Uncovers Novel Transcription and Alternative Splicing1[C][W][OA] , 2013, Plant Physiology.

[71]  V. K. Jayaraman,et al.  Identification of Penicillin-binding proteins employing support vector machines and random forest , 2013, Bioinformation.

[72]  Whitney Wooderchak-Donahue,et al.  A support vector machine for identification of single-nucleotide polymorphisms from next-generation sequencing data , 2013, Bioinform..

[73]  M. Alló,et al.  Alternative splicing: a pivotal step between eukaryotic transcription and translation , 2013, Nature Reviews Molecular Cell Biology.

[74]  Y. Zhang,et al.  In vivo genome-wide profiling of RNA secondary structure reveals novel regulatory features , 2013, Nature.

[75]  R. Amann,et al.  Predictive Identification of Exonic Splicing Enhancers in Human Genes , 2022 .