Predicting cassette exons using transductive learning approaches

Recent advances in biotechnology have resulted in large volumes of genomic and proteomic data leading to the emergence of numerous in silico methods for annotation, such as supervised machine learning approaches. Such algorithms, however, require large amounts of labeled data for training. In practice, labeled data is oftentimes limited because it is difficult to obtain. Therefore, semi-supervised machine learning is preferable, in which classifiers trained on limited amounts of labeled data can be improved by exploiting the large amounts of unlabeled data. In this work, we focus on transductive learning, a special case of semi-supervised learning. A semi-supervised algorithm builds an inductive model that generalizes well to new, unseen (test) instances. In contrast, during the training phase, a transductive algorithm has access to the (test) instances that need to be classified, allowing advantageous utilization of these points in order to reach the best separation function. Compared to learning a classifier for use with future data, cassette exon identification is a suitable application for transductive learning, since the goal is to annotate a sequenced genome for which a limited amount of labeled data is available. We study the applicability of three popular transductive techniques and their compatibility with various kernels to the binary DNA classification problem of cassette exon identification. The results of our experiments suggest that transductive learning is a useful approach for assisting genome annotation.

[1]  Alexandre Varnek,et al.  Transductive Support Vector Machines: Promising Approach to Model Small and Unbalanced Datasets , 2013, Molecular informatics.

[2]  Feng Xia,et al.  Label matrix normalization for semisupervised learning from imbalanced Data , 2014, New Rev. Hypermedia Multim..

[3]  Robert D. Nowak,et al.  Unlabeled data: Now it helps, now it doesn't , 2008, NIPS.

[4]  Banu Diri,et al.  Unlabelled extra data do not always mean extra performance for semi‐supervised fault prediction , 2009, Expert Syst. J. Knowl. Eng..

[5]  Lan Lin,et al.  Predicting Functional Alternative Splicing by Measuring RNA Selection Pressure from Multigenome Alignments , 2009, PLoS Comput. Biol..

[6]  Hiram Clawson,et al.  Intronic Alternative Splicing Regulators Identified by Comparative Genomics in Nematodes , 2006, PLoS Comput. Biol..

[7]  R. Amann,et al.  Predictive Identification of Exonic Splicing Enhancers in Human Genes , 2022 .

[8]  Shankar Kumar,et al.  Video suggestion and discovery for youtube: taking random walks through the view graph , 2008, WWW.

[9]  Doina Caragea,et al.  Ensemble-based semi-supervised learning approaches for imbalanced splice site datasets , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[10]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[11]  Doina Caragea,et al.  Predicting alternatively spliced exons using semi-supervised learning , 2016, Int. J. Data Min. Bioinform..

[12]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[13]  Ke Wang,et al.  Profile-based string kernels for remote homology detection and motif extraction , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[14]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[15]  Koby Crammer,et al.  New Regularized Algorithms for Transductive Learning , 2009, ECML/PKDD.

[16]  Yuzong Liu,et al.  Graph-based semi-supervised learning for phone and segment classification , 2013, INTERSPEECH.

[17]  Gunnar Rätsch,et al.  RASE: recognition of alternatively spliced exons in C.elegans , 2005, ISMB.

[18]  Nan Deng,et al.  dSpliceType: A Multivariate Model for Detecting Various Types of Differential Splicing Events Using RNA-Seq , 2014, ISBRA.

[19]  Doina Caragea,et al.  Prediction of alternatively spliced exons using Support Vector Machines , 2010, Int. J. Data Min. Bioinform..

[20]  Paola Bonizzoni,et al.  ASPIC: a novel method to predict the exon-intron structure of a gene that is optimally compatible to a set of transcript sequences , 2005, BMC Bioinformatics.

[21]  Yong Ren,et al.  Sentiment Classification in Under-Resourced Languages Using Graph-Based Semi-Supervised Learning Methods , 2014, IEICE Trans. Inf. Syst..

[22]  Shaoning Pang,et al.  Transductive support vector machines and applications in bioinformatics for promoter recognition , 2003, International Conference on Neural Networks and Signal Processing, 2003. Proceedings of the 2003.

[23]  Katrin Kirchhoff,et al.  Phonetic Classification Using Controlled Random Walks , 2011, INTERSPEECH.

[24]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[25]  Songcan Chen,et al.  Safety-Aware Semi-Supervised Classification , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[26]  Jason Weston,et al.  Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.

[27]  Bernhard Schölkopf,et al.  Protein functional class prediction with a combined graph , 2003, Expert Syst. Appl..

[28]  George Karypis,et al.  Profile-based direct kernels for remote homology detection and fold recognition , 2005, Bioinform..

[29]  Guohua Wang,et al.  Genome-wide prediction of cis-acting RNA elements regulating tissue-specific pre-mRNA alternative splicing , 2009, BMC Genomics.

[30]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[31]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[32]  Ron Shamir,et al.  Accurate identification of alternatively spliced exons using support vector machine , 2005, Bioinform..

[33]  Doina Caragea,et al.  Au th or ’ s Co py Semi-Supervised Self-training Approaches for Imbalanced Splice Site Datasets , 2014 .

[34]  Gunnar Rätsch,et al.  Support Vector Machines and Kernels for Computational Biology , 2008, PLoS Comput. Biol..

[35]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[36]  Zhi-Hua Zhou,et al.  Towards Making Unlabeled Data Never Hurt , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.